Aiming at the problems of low efficiency in binocular vision measurements based on feature point detection and high computational complexity of neural networks, a binocular vision localization and measurement method was proposed based on a lightweight HRNet. The lightweight HRNet was built upon the original HRNet by replacing the convolutional modules to reduce the number of parameters, introducing Transformer to extract global image features, and employing a multi-level upsampling fusion strategy to capture the multi-scale feature information. Compared with the original HRNet model, the lightweight HRNet reduces model parameters by 95.40%, while computational loads and normalized mean errors are decreased by 94.27% and 6.25% respectively. In terms of 3D measurement, the relative errors of the method combining lightweight HRNet with binocular vision reache 0.256%, enabling high-precision detection on hardware with low computational power.
基于卷积的神经网络在捕捉长距离依赖关系和全局上下文信息方面存在局限性。受到自然语言处理(natural language processing,NLP)的启发,研究者将Transformer模型[15]引入计算机视觉领域并且取得较好效果。CHEN等[16]结合Transformer与U-Net提出的Transunet在分割任务中取得较好效果。结合卷积与Transformer的混合模型使网络既有卷积的归纳偏置特性,又有Transformer的全局归纳建模能力。
网络输入图片 X 的尺寸为H×W、通道为,即 X ∈RH×W×3。在预处理阶段,输入图片4倍下采样得到通道数为C的特征图 X1∈RH/4×W/4×C。主干网络有4个阶段,在每个阶段依次增加1个平行分支。对平行分支特征图进行2倍下采样,通道数增加一倍。每个分支由2个Shuffle block和1个多分辨率融合单元组成,各单元在其指定的分辨率上进行特征提取与跨分支交互。主干网络最终输出4个尺度的特征图,它们宽度的分辨率分别为原图的1/4、1/8、1/16、1/32,通道数分别为C、2C、4C、8C。在分辨率最低的特征流上加入Transformer层,以获取空间的相互关系。在网络颈部采用多级上采样融合策略,逐步融合低分辨率特征图与高分辨率特征图。低分辨率特征图上采样后,与高分辨率特征图进行通道连接。之后,通过重建模块减少空间和通道的冗余,得到融合特征图。最后,通过输出头模块将特征图上采样至与输入图像相同的分辨率,并生成热图。
针对低分辨率分支特征信息利用不足的问题,在颈部使用多级上采样融合的方式自底向上逐步融合多尺度特征。低分辨率特征图向高分辨率特征图的融合过程如图8所示。首先,将低分辨率特征图上采样至与到高分辨率特征图相同的分辨率。然后,将上采样后的特征图与对应的高分辨率特征图进行通道拼接。融合后的特征图包含的特征信息较多,故接入特征建模模块(SCR),以抑制冗余特征,得到低分辨率与高分辨率的融合特征图。SCR由2个1×1卷积和1个空间和通道重构卷积(spatial and channel reconstruction convolution,SCConv)[18]组成,空间和通道重构卷积可自适应抑制特征冗余并促进代表性特征的学习,有效降低模型的计算复杂度和参数量。
XuSixiang, ChenFuqiang, GaoPeiqing, et al. System for Removing Slab Burrs: CN102935547B[P]. 2014-10-15.
[3]
XUSixiang, DONGChenchen, ZHOUShuhua, et al. Binocular Measurement Method for the Continuous Casting Slab Model Based on the Improved BRISK Algorithm[J]. Applied Optics, 2022, 61(11): 3019-3025.
ZHOUShuhua, XUSixiang, DONGChenchen, et al. Algorithm for Binocular Vision Measurements Based on Local Information Entropy and Gradient Drift[J]. Laser & Optoelectronics Progress, 2023, 60(12): 333-341.
SONGXiang, XUSixiang, YANGLifa, et al. Binocular Vision Measurement Method Based on Nonlinear Diffusion and High-dimensional M-SURF Descriptor[J]. Journal of Optoelectronics·Laser, 2024, 35(4): 405-413.
XIEYang, DAIYiqun, ZHANGChaoyong, et al. A Method for Identifying and Predicting Energy Consumption of Machine Tools by Combining Integrated Models and Deep Learning[J]. China Mechanical Engineering, 2023, 34(24): 2963-2974.
[10]
BORRAS R, PREMALATHAB, DIVYAG, et al. Deep Hashing with Multilayer CNN-based Biometric Authentication for Identifying Individuals in Transportation Security[J]. Journal of Transportation Security, 2024, 17(1): 4.
[11]
MASUDAY, ISHIKAWAR, TANAKAT, et al. CNN-based Fully Automatic Mitral Valve Extraction Using CT Images and Existence Probability Maps[J]. Physics in Medicine & Biology, 2024, 69(3): 035001.
[12]
SHELHAMERE, LONGJ, DARRELLT. Fully Convolutional Networks for Semantic Segmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(4): 640-651.
[13]
RONNEBERGERO, FISCHERP, BROXT. U-Net: Convolutional Networks for Biomedical Image Segmentation[M]∥Medical Image Computing and Computer-assisted Intervention—MICCAI 2015. Cham: Springer International Publishing, 2015: 234-241.
[14]
SUNKe, XIAOBin, LIUDong, et al. Deep High-resolution Representation Learning for Human Pose Estimation[C]∥2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, 2019: 5686-5696.
[15]
WANGJian, LONGXiang, CHENGuowei, et al. U-HRnet: Delving into Improving Semantic Representation of High Resolution Network for Dense Prediction[J].arXiv:2210.07140.
[16]
HOWARDA G, ZHUMenglong, CHENBo, et al. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications[J]. arXiv:1704.04861.
[17]
ZHANGXiangyu, ZHOUXinyu, LINMengxiao, et al. ShuffleNet: an Extremely Efficient Convolutional Neural Network for Mobile Devices[C]∥2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, 2018: 6848-6856.
ZHENGYunfei, WANGXiaobing, ZHANGXiongwei, et al. The Self-distillation HRNet Object Segmentation Based on the Pyramid Knowledge[J]. Acta Electronica Sinica, 2023, 51(3): 746-756.
[20]
VASWANIA, SHAZEERN, PARMARN,et al. Attention is All You Need[C]∥Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach,2017:6000-6010.
[21]
CHENJieneng, LUYongyi, YUQihang,et al. TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation[J]. arXiv:
[22]
ZHANGZhengyou. A Flexible New Technique for Camera Calibration[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2000,22(11):1330-1334.
[23]
LIJiafeng, WENYing, HELianghua. SCConv: Spatial and Channel Reconstruction Convolution for Feature Redundancy[C]∥2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). Vancouver, 2023: 6153-6162.
LITongpu, XUSixiang, SHIYuxiang, et al. Continuous Casting Slab Model Positioning and Measurement Based on Binocular Vision and Transformer[J]. Journal of Central South University (Science and Technology), 2024, 55(4): 1312-1322.