The traditional self-supervised monocular depth estimation model has limitations in extracting and fusing shallow features, leading to issues such as omission detection of small objects and blurring of object edges. To address these problems, a self-supervised monocular depth estimation model based on improved dense network and wavelet decomposition is proposed in this paper. The whole framework of the model follows the structure of U-net, in which the encoder adopts the improved densenet to improve the ability of feature extraction and fusion. A detail enhancement module is introduced in the skipping connections to further refine and integrate the multi-scale features generated by the encoder. The decoder incorporates wavelet decomposition, enabling better focus on high-frequency information during decoding to achieve precise edge refinement. Experimental results demonstrate that our method exhibits stronger capability in capturing depth estimation for small objects, resulting in clearer and more accurate edges in the generated depth map.
具体来说,首先将编码器输出的特征图通过一个3×3卷积层探取特征 X。随后,通过全局池化层将特征 X 压缩成一个向量,获取上下文信息,并通过2个1×1卷积层和一个Sigmoid激活函数得出权重向量 Y,以重新校准不同通道的重要性。接着,对 X 和 Y 进行逐像素点积运算以重新生成加权特征,通过这个操作,包含关键信息的通道特征将获得更大的权重,从而增强多尺度特征图的边缘细节。最后,将重新生成的加权特征与特征 X 进行整合,其数学表达式为:
WangXin-zhu, LiJun, LiHong-jian, et al. Obstacle detection based on 3D laser scanner and range image for intelligent vehicle[J]. Journal of Jilin University (Engineering and Technology Edition), 2016, 46(2): 360-365.
ShiXiao-gang, XueZheng-hui, LiHui-hui, et al. Overview of augmented reality display technology [J]. China Optics, 2021, 14 (5): 1146-1161.
[7]
EigenD, FergusR. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture[C]∥2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 2015: 2650-2658.
[8]
FuH, GongM, WangC, et al. Deep ordinal regression network for monocular depth estimation[C]∥ 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 2002-2011.
[9]
GargR, VijayK B G, CarneiroG, et al. Unsupervised CNN for single view depth estimation: geomery to the rescue[C]∥European Conference Computer Vision, Amsterdam, Netherlands, 2016: 740-756.
[10]
ZhouT H, BrownM, SnavelyN, et al. Unsupervised learning of depth and ego-motion from video[C]∥2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Hawaii, USA, 2017: 1851-1858.
[11]
ClémentG, OisinM A, MichaelF, et al. Digging into self-supervised monocular depth estimation[C]∥ 2015 IEEE International Conference on Computer Vision (ICCV), Seoul, South Korea, 2019: 3828-3838.
[12]
AshutoshS, SunM, AndrewY N. Make3D:learning 3D scene structure from a single still image[J].IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008, 31(5): 824-840.
[13]
EigenD, PuhrschC, FergusR. Depth map prediction from a single image using a multi-scale deep network[C]∥Advances in Neural Information Processing Systems, Montreal, Canada, 2014: 2366-2374.
[14]
ZacharyT, JiaD. Deepv2D: video to depth with differentiable structure from motion[C]∥International Conference on Learning Representations (ICLR) 2020, Addis Ababa, Ethiopian, 2020: 181204605.
[15]
BenjaminU, ZhouH Z, JonasU, et al. Demon: depth and motion network for learning monocular stereo[C]∥2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Hawaii, USA, 2017: 5038-5047.
[16]
ClémentG, OisinM A, GabrielJ B. Unsupervised monocular depth estimation with left-right consistency[C]∥2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Hawaii, USA, 2017: 270-279.
[17]
BianJ W, LiZ C, WangN, et al. Unsupervised scale-consistent depth and ego-motion learning from monocular video[C]∥33rd Conference on Neural Information Processing Systems (NeurIPS), Vancouver, Canada, 2019: 1-12.
[18]
HanC, ChengD, KouQ, et al. Self-supervised monocular depth estimation with multi-scale structure similarity loss[J]. Multimedia Tools and Applications, 2022, 31: 3251-3266.
[19]
XiangJ, WangY, AnL,et al. Visual attention-based self-supervised absolute depth estimation using geometric priors in autonomous driving[J/OL].(2022-10-06)[2023-06-13].
[20]
SuriZ K. Pose constraints for consistent self-supervised monocular depth and ego-motion[J/OL].(2023-04-18)[2023-06-13].
[21]
HoussemB, AdrianV, AndrewC. STDepthFormer: predicting spatio-temporal depth from video with a self-supervised transformer model[C]∥Detroit, USA, 2023: No.230301196.
[22]
MatteoP, FilippoA, FabioT, et al. Towards real-time unsupervised monocular depth estimation on CPU[C]∥2018 IEEE/RSJ international Conference Intelligent Robots and Systems (IROS), Madrid, Spain, 2018: 5848-5854.
[23]
DianaW, MaF C, YangT J, et al. FastDepth: fast monocular depth estimation on embedded systems[C]∥2019 International Conference on Robotics and Automation (ICRA), Montreal, Canada, 2019: 6101-6108.
[24]
MichaelR, MichaelF, JamieW, et al. Single image depth prediction with wavelet decomposition[C] ∥ Conference on Computer Vision and Pattern Recognition (CVPR), Online, 2021: 11089-11098.
[25]
OlafR, PhilippF, ThomasB. U-Net: convolutional networks for biomedical image segmentation[C]∥International Conference On Medical Computing and Computer-Assisted Intervention (MICCAI), Munich, Germany, 2015: 234-241.
[26]
HuangG, LiuZ, MaatenL V D, et al. Densely connected convolutional networks[C]∥2017 Conference on Computer Vision and Pattern Recognition (CVPR), Hawaii, USA, 2017: 2261-2269.
[27]
HeK, ZhangX, RenS, et al. Deep residual learning for image recognition[C]∥2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, USA, 2016: 770-778.
[28]
ChenX T, ChenX J, ZhaZ J. Structure-aware residual pyramid network for monocular depth estimation[C]∥28th International Joint Conference on Artificial Intelligence, Macau, China, 2019: 694-700.
[29]
GeigerA, LenzP, StillerC, et al. Vision meets robotics: the kitti dataset[J]. The International Journal of Robotics Research, 2013, 32(11): 1231-1237.
[30]
PleissG, ChenD, HuangG, et al. Memory-efficient implementation of densenets[J/OL].(2017-07-21)[2023-06-13].
[31]
MehtaI, SakurikarP, NarayananP J. Structured adversarial training for unsupervised monocular depth estimation[C]∥2018 International Conference on 3D Vision, Verona, Italy, 2018: 314-323.
[32]
MatteoP, FabioT, StefanoM. Learning monocular depth estimation with unsupervised trinocular assumptions[C]∥International Conference on 3D Vision (3DV), Verona, Italy, 2018: 324-333.
[33]
SudeepP, RaresA, AmbrusG, et al. Superdepth: self-supervised, super-resolved monocular depth estimation[C]∥2019 International Conference on Robotics and Automation (ICRA), Montreal, Canada, 2019: 9250-9256.