Learning Depth for Scene Reconstruction using an Encoder-decoder Model

Xiaohan Tu, Cheng Xu, Siping Liu, Guoqi Xie, Jing Huang, Renfa Li, Junsong Yuan
2020 IEEE Access  
Depth estimation has received considerable attention and is often applied to visual simultaneous localization and mapping (SLAM) for scene reconstruction. At least to our knowledge, sufficiently reliable depth always fails to be provided for monocular depth estimation-based SLAM because new image features are rarely re-exploited effectively, local features are easily lost, and relative depth relationships among depth pixels are readily ignored in previous depth estimation methods. Based on
more » ... urate monocular depth estimation, SLAM still faces scale ambiguity problems. To accurately achieve scene reconstruction based on monocular depth estimation, this paper makes three contributions. (1) We design a depth estimation model (DEM), consisting of a precise encoder to re-exploit new features and a decoder to learn local features effectively. (2) We propose a loss function using the depth relationship of pixels to guide the training of DEM. (3) We design a modular SLAM system containing DEM, feature detection, descriptor computation, feature matching, pose prediction, keyframe extraction, loop closure detection, and pose-graph optimization for pixel-level scene reconstruction. Extensive experiments demonstrate that the DEM and DEM-based SLAM are effective. (1) Our DEM predicts more reliable depth than the state of the arts when inputs are RGB images, sparse depth, or the fusion of both on public datasets. (2) The DEM-based SLAM system achieves comparable accuracy as compared with well-known modular SLAM systems. INDEX TERMS Convolutional neural networks, depth estimation, decoder, encoder, simultaneous localization and mapping. Recently, most previous work uses convolutional neural networks (CNNs) to predict depth from sparse depth [3], monocular RGB images [1], [4], [5], [6], or the fusion of both [7], [8]. These prior methods tend to leverage CNN models like ResNets [9] to learn visual features, such as the research [5]-[8]. Here, ResNets are often adopted as encoders [5], [8] , and these encoders reuse image features in depth estimation. After encoders, decoders including linear interpolation are commonly used to output high-resolution depth maps, yet the decoders easily lose local image features. In essence, recent encoders and decoders suffer from accuracy limitation imposed by their respective shortcomings of effectively re-exploring new features and learning local features in depth prediction. Additionally, researchers often estimate depth with CNNs driven by ground-truth metric depth, usually ignoring the use of relative depth relationships among pixels in RGB images. We find that CNNs are trained better 89300 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ VOLUME 8, 2020
doi:10.1109/access.2020.2993494 fatcat:wlri3m4fr5bpraul2lqr3snp4i