filled depth map completed using interpolation
methods and filtering (Senushkin et al., 2020;
de Queiroz Mendes et al., 2021). When compared to
semi-dense depth maps provided by Kinect sensors,
depth maps obtained by uniformly sampling LiDAR
interpolation measurements do not present the same
kind of missing areas observed in our work. The uni-
formly sampling step may reproduce the same density
of valid pixels but fails in concentrating them on ob-
ject edges or reflective, transparent, or irregular sur-
faces (Senushkin et al., 2020). Furthermore, while
outside environments are the ordinary case of LiDAR
sensors application, low-cost depth sensing devices
are preferred for indoor tasks.
Finally, image inpainting consists of methods for
repairing missing areas in an image, usually tackling
small and thin regions or large flat areas (Xian et al.,
2020). The focus is obtaining feasible results based
on the scene rather than generating accurate pixel val-
ues, as required by depth completion (Huang et al.,
2019). Nevertheless, (Huang et al., 2019) used an im-
age inpainting method for depth completion using a
self-attention mechanism and gated convolutions.
In this paper, we assume the input is a RGB-D
image provided by a low-cost sensor for an indoor
scene containing missing regions, while the output
is the complete depth map of the view. Our method
aims to fill in all missing areas in the depth map using
cues provided by the full RGB image. Therefore, the
ground truth depth maps provided by the dataset must
be composed of fully completed depth maps. LiDAR
data would not be appropriate in our context since
the scale difference might not provide the same spa-
tial distribution of holes. For example, while miss-
ing areas across object edges reduce as the distance
to the sensor increases, outdoor scenes may present
large holes at the top of the depth map correspond-
ing to the sky. We investigate the use of an Encoder-
Decoder Convolutional Neural Network (CNN) using
a novel loss function that combines the depth estima-
tion with an inverted Euclidean Distance Transform
(EDT). Our contribution is to show the EDT can guide
the training to better complete large missing regions
that represent the most difficult part of such a prob-
lem, and to enhance it with a edge-aware refinement
module for smaller error rates.
The proposed method has two modules: a depth
completion step from RGB-D input and a refinement
module using GeoNet++ (Qi et al., 2020) to enhance
the full depth obtained from the depth completion
module. Figure 1 provides an overview of both mod-
ules and shows where the RGB image is used to guide
the depth completion of a raw depth map.
1.1 Related Works
Although previous studies on depth map completion
have addressed the task using traditional image pro-
cessing techniques, such as bilateral filtering (Chen
et al., 2012; Bapat et al., 2015) and Fourier trans-
form (Raviya et al., 2019), deep neural networks can
be particularly useful to learn from existing data in
an attempt to generalize to unseen maps (Ponti et al.,
2021).
In terms of deep learning-based depth comple-
tion of semi-dense indoor depth maps, the method
described by (Zhang and Funkhouser, 2018) was the
first to define the problem of filling large missing ar-
eas in depth maps acquired using commodity-grade
depth cameras. The solution presented predicts local
properties of the visible surface at each pixel (occlu-
sion boundaries and surface normals) and then apply a
global optimization step to solve them back to depth.
Another problem addressed by (Zhang and
Funkhouser, 2018) is creating a dataset containing
RGB-D images paired with their respective com-
pleted depth images. The solution adopted consisted
in utilizing existing surface meshes reconstructed
from existing multi-view RGB-D datasets. The pro-
jection of different meshes from the image viewpoint
fills the missing areas providing accurate ground truth
images.
Using the same approach to calculate local prop-
erties of the visible surface shown by (Zhang and
Funkhouser, 2018), the work presented by (Huang
et al., 2019) replaces the global optimization step with
a U-Net with self-attention blocks projected for im-
age inpainting. To preserve object edges along the
depth reconstruction process, a boundary consistency
module was applied after the attention module. Both
methods, however, still exploit external data to train a
surface normal estimation network.
Later, an adaptive convolution operator was pro-
posed to fill in the depth map progressively (Xian
et al., 2020). The depth completion module, whose
inputs were only the raw depth maps, was used along
with a refinement network considering patches of the
RGB component and the completed depth map. Fur-
thermore, a subset of the NYU-v2 dataset (Silberman
et al., 2012) containing RGB-D images captured us-
ing Microsoft Kinect and their respective ground truth
depth maps was provided.
As (Xian et al., 2020), (Senushkin et al., 2020)
only exploited the 4D input provided by RGB-D data.
Their method uses Spatially-Adaptive Denormaliza-
tion blocks to control the decoding of a dense depth
map by dealing with the statistical differences of re-
gions corresponding to data acquired or to a hole.
Deep Depth Completion of Low-cost Sensor Indoor RGB-D using Euclidean Distance-based Weighted Loss and Edge-aware Refinement
205