Objective Pedestrian detection is a crucial task in computer vision, particularly in applications such as autonomous driving, robot navigation, and intelligent surveillance. However, pedestrian occlusion in real-world scenarios remains a significant challenge. Occlusion causes a sharp reduction in the visible range of targets and a substantial loss of pedestrian features, making it difficult for detectors to effectively distinguish between targets and pedestrians. Existing methods, including post-processing optimization, specific model-based improvements, and body-part feature-based approaches, have limitations such as inaccurate handling of heavily occluded positive samples, high computational complexity, and susceptibility to background noise. Therefore, developing a more effective method to address pedestrian occlusion detection is essential to enhance the performance of pedestrian detectors. Methods The proposed global feature focusing and information enhancement network (GFFIE‒Net) employed HRNet‒W32 as the backbone network to generate multi-scale feature maps with different resolutions (1/4, 1/8, 1/16, and 1/32 of the input image). These feature maps captured both high-level semantic information and low-level spatial details, which were essential for detecting pedestrians in complex scenes. The convolutional block attention module (CBAM) was embedded after the feature maps to enhance the feature representation and reduce background noise interference. CBAM adjusted the importance of each channel and spatial location in the feature maps through operations such as global average pooling, maxpooling, and small fully connected neural networks in both channel and spatial attention dimensions. This process strengthened the feature information in key areas and suppressed background noise, enabling the network to focus on the target area. Then, considering the limitations of CNN-based methods in global information extraction, the Mamba module was cascaded after the CBAM. The Mamba module first flattened the feature maps into one-dimensional image patch vectors and then used linear layers for feature extraction and transformation. It captured global contextual information and long-range dependencies between feature vectors through forward and backward processing using the state space model (SSM). This process assisted in extracting contextual information around occluded pedestrians and inferring complete pedestrian features based on visible ones. Finally, a hierarchical feature fusion mechanism was designed. This mechanism first utilized the bilinear interpolation algorithm to adjust the spatial resolution of different-scale feature maps to be consistent. Then, it concatenated the three high-dimensional and low-resolution feature maps rich in semantic information along the channel dimension to enhance the deep semantic representation. After that, it combined the preliminarily fused feature map with the low-dimensional and high-resolution feature map containing more detailed location information along the channel dimension. This achieved a comprehensive fusion of high-level semantic and positional detail information, enabling the algorithm to capture multi-level semantic features. The final feature map was processed by a detection head, which generated center heatmaps, scale heatmaps, and offset maps to predict pedestrian bounding boxes. Results and Discussions Ablation experiments were designed from four aspects to comprehensively verify the effectiveness of the proposed GFFIE-Net improvements. First, the effects of different global information extraction methods on the experimental results were investigated. Second, the effects of various modules on the network performance were analyzed. Third, the impact of different scales on network performance, sequential cascade structure, and the rationalization of hierarchical feature fusion were explored. Fourth, the robustness of the designed enhancement modules was verified by testing them on different backbone networks. Extensive experiments were conducted on three challenging pedestrian datasets: CityPersons, Caltech, and CrowdHuman. The experimental results showed that the R metric reached 43.7% on the heavily occluded subset of the CityPersons dataset, representing an improvement of 4.4 percentage points compared to the baseline method; 33.6% on the heavily occluded subset of the Caltech dataset; and 43.2% on the CrowdHuman dataset, outperforming several mainstream methods. Finally, a visualization analysis of the detection boxes and center heatmaps was conducted. Seven representative practical scene images were selected from the three datasets, including traffic, intersection video surveillance, nighttime, high-density traffic, strong light, small target, and crowded pedestrian scenes. The results showed that compared to the baseline network, GFFIE‒Net produced more significant central responses and more accurate detection box positioning for occluded pedestrians. In the high-density traffic scene, for example, when multiple pedestrians were occluded by one another, the baseline network failed to detect many pedestrians, and the central heatmap exhibited weak responses to occluded individuals. In contrast, GFFIE‒Net accurately identified and located occluded pedestrians. This indicated that GFFIE‒Net effectively handled occluded pedestrians in various scenarios, demonstrating strong adaptability and high detection performance. Conclusions The proposed GFFIE‒Net, integrating the CBAM module, Mamba module, and hierarchical feature fusion mechanisms, effectively addresses the challenges of feature loss and background noise in occluded scenarios. The experimental results from three benchmark datasets demonstrate the superiority of GFFIE‒Net compared to existing methods, particularly in managing heavily occluded pedestrians. Future research can explore semi-supervised or self-supervised learning using limited labeled data. This approach can reduce dependence on large-scale labeled datasets, enhance model generalization, and improve the method's applicability and accuracy across diverse scenarios.
随着深度学习在计算机视觉领域的快速发展,行人检测方法从传统的手工特征提取转向了基于深度学习的方法。早期的区域卷积神经网络(RCNN)系列方法[4‒6]将检测任务分为两个阶段,通过预设固定尺寸的锚框进行分类和回归,在当时取得了显著领先的检测精度,为行人检测技术的发展奠定了重要基础。然而,随着对行人检测研究的深入,发现锚框的超参数(如尺寸、纵横比和数量)对检测器性能影响极大,成为检测性能进一步提升的主要障碍。为了解决这一问题,基于无锚框的行人检测方法应运而生,这类方法无须预设锚框,可以直接端到端地检测行人,避免了手工设计锚框的限制,因此逐渐受到研究者的青睐。例如,CornerNet[7]和CenterNet[8]通过重点关注目标的角点位置,使用角池化层增强特征以更准确地定位目标的左上角和右下角,还可以在角点对的基础上,加入目标中心关键点,组成三元组进行检测,既能捕捉目标的边界信息,又能兼顾目标的内部信息。中心与尺度预测(CSP)[9]则将检测任务转化为中心点与相应尺度的预测任务,该方法简单且有效。本文提出的全局特征聚焦与信息增强网络(global feature focusing and information enhancement network,GFFIE-Net)同样采用基于中心点和尺度预测的方法,但在具体实现方面具有独特的设计和优势。为解决遮挡问题这一关键挑战,近年来,研究者们主要从后处理优化、基于特定模型改进以及基于身体部位特征处理3个方面深入探索解决方案。在后处理优化方面,研究者们通过设计特定损失函数和非极大值抑制(NMS)改进策略。例如,RepLoss[10]等方法通过改进损失函数,对定位不准确的检测框进行惩罚,进一步提升了被遮挡行人的检测准确性。BIA-NMS[11]和OTP-NMS[12]等方法通过优化NMS策略,更灵活地抑制冗余检测结果,有效保留可能被遮挡的目标。然而,尽管研究者们持续对NMS进行优化,但由于其抑制样本框的特性,遮挡严重的正样本仍可能被误判为假阳性,从而使得遮挡问题难以有效解决。在基于特定模型改进方面,随着Transformer[13‒14]在计算机视觉领域的巨大成功,研究者们开始探索基于Transformer的目标检测算法(DETR)模型[15]的方法。DETR采用一对一的标签匹配策略进行预测,有效规避了NMS的部分缺点。特别是Deformable DETR[16],通过引入可变形注意力机制,自适应地调整输入特征,显著增强了处理遮挡的灵活性,并更好地捕获了目标间的联系。然而,Transformer引入的自注意力机制受到二次时间复杂度的限制,难以部署在高分辨率特征图上。此外,DETR需要手工设计Query数量及一些超参数,这限制了其在不同场景下的灵活性。特别是在面对不同人群密度的数据集时,如CityPersons[17]和CrowdHuman[18],在Query数量的设置上,后者必须设置为前者的两倍,否则性能就不如基于卷积神经网络(CNN)的基准网络。在基于身体部位特征处理方面,鉴于遮挡情况下行人身体部分可见的特点,研究者们进一步提出了基于身体部分模型的方法。该方法根据人体不同部位的特点,采用分而治之策略,对不同身体部位分别设计相应的检测器,以辅助行人整体检测。例如,Bi-Center[19]和OAF-Net[20]在后处理中设计多个中心预测分支,分别针对不同遮挡程度的行人进行检测。然而,上述3类方法存在一些共性问题,尚未得到有效解决。例如,未充分考虑在网络特征提取阶段面临的困难,致使在处理高分辨率特征图时,难以获取充足的全局信息,同时网络极易受到背景噪声的干扰。这些问题在一定程度上限制了现有方法对行人遮挡问题的处理能力。
GuoYongcun, YangTun, WangShuang.Multi-object real-time detection of mine electric locomotive based on improved YOLOv4-tiny[J].Advanced Engineering Sciences,2023,55(5):232‒241.
RenShaoqing, HeKaiming, GirshickR,et al.Faster R-CNN:Towards real-time object detection with region proposal networks[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2017,39(6):1137‒1149. doi:10.1109/tpami.2016.2577031
[7]
HeKaiming, GkioxariG, DollárP,et al.Mask R-CNN[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2020,42(2):386‒397. doi:10.1109/tpami.2018.2844175
[8]
CaiZhaowei, VasconcelosN.Cascade R-CNN:Delving into high quality object detection[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE,2018:6154‒6162. doi:10.1109/cvpr.2018.00644
[9]
LawH, DengJia.CornerNet:Detecting objects as paired keypoints[C]//Computer Vision-ECCV 2018.Cham:Springer,2018:765‒781. doi:10.1007/978-3-030-01264-9_45
[10]
DuanKaiwen, BaiSong, XieLingxi,et al.CenterNet:Keypoint triplets for object detection[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV).Piscataway:IEEE,2019:6568‒6577. doi:10.1109/iccv.2019.00667
[11]
LiuWei, LiaoShengcai, RenWeiqiang,et al.High-level semantic feature detection:A new perspective for pedestrian detection[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).Long Beach:IEEE,2019:5182‒5191. doi:10.1109/cvpr.2019.00533
[12]
WangXinlong, XiaoTete, JiangYuning,et al.Repulsion loss:Detecting pedestrians in a crowd[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE,2018:7774‒7783. doi:10.1109/cvpr.2018.00811
[13]
AbdelmutalabA, WangChunyan.Pedestrian detection using MB-CSP model and boosted identity aware non-maximum suppression[J].IEEE Transactions on Intelligent Transportation Systems,2022,23(12):24454‒24463. doi:10.1109/tits.2022.3196854
[14]
TangYi, LiuMin, LiBaopu,et al.OTP-NMS:Toward optimal threshold prediction of NMS for crowded pedestrian detection[J].IEEE Transactions on Image Processing,2023,32:3176‒3187. doi:10.1109/tip.2023.3273853
[15]
DosovitskiyA, BeyerL, KolesnikovA,et al.An image is worth 16×16 words:Transformers for image recognition at scale[EB/OL].(2021-06-03)[2024-10-20].
ZhangShanshan, BenensonR, SchieleB.CityPersons:A diverse dataset for pedestrian detection[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR).Piscataway:IEEE,2017:4457‒4465. doi:10.1109/cvpr.2017.474
[21]
ShaoShuai, ZhaoZijian, LiBoxun,et al.CrowdHuman:a benchmark for detecting human in a crowd[EB/OL].(2018-04-30)[2024-10-20].
[22]
LiQiming, BiYuquan, CaiRongsheng,et al.Occluded pedestrian detection through Bi-Center prediction in anchor-free network[J].Neurocomputing,2022,507:199‒207. doi:10.1016/j.neucom.2022.08.026
[23]
LiQiming, SuYijing, GaoYin,et al.OAF-net:An occlusion-aware anchor-free network for pedestrian detection in a crowd[J].IEEE Transactions on Intelligent Transportation Systems,2022,23(11):21291‒21300. doi:10.1109/tits.2022.3171250
LiRuihong, FuZhitao, ZhangShaochen,et al.Nighttime object detection in infrared and visible images based on multi-attention mechanism[J].Infrared Technology,2024,46(12):1371‒1379.
ZangYing, CaoRunlong, LiHui,et al.MAPD:Multi-receptive field and attention mechanism for multispectral pedestrian detection[J].The Visual Computer,2024,40(4):2819‒2831. doi:10.1007/s00371-023-02988-7
[28]
GuA, DaoT.Mamba:linear-time sequence modeling with selective state spaces[EB/OL].(2023-12-01)[2024-10-20].
[29]
ZhuLianghui, LiaoBencheng, ZhangQian,et al.Vision Mamba:efficient visual representation learning with bidirectional state space model[EB/OL].(2024‒11-14)[2024-10-20].
[30]
DingZhengze, NieRencan, LiJintao,et al.MTFuse:An infrared and visible image fusion network based on mamba and transformer[J].Computer Science,2025,52(8):188‒194.
ShiYangyu, XieChengjie, ZhengDiwen,et al.Multi-scale anomaly behavior detection based on Mamba-CNN[J/OL].Journal of Beijing University of Aeronautics and Astronautics,[2025-01-08].
SunKe, XiaoBin, LiuDong,et al.Deep high-resolution representation learning for human pose estimation[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).Piscataway:IEEE,2019:5686‒5696. doi:10.1109/cvpr.2019.00584
[35]
DollarP, WojekC, SchieleB,et al.Pedestrian detection:An evaluation of the state of the art[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2012,34(4):743‒761. doi:10.1109/tpami.2011.155
QuanYu, ZhangDong, ZhangLiyan,et al.Centralized feature pyramid for object detection[J].IEEE Transactions on Image Processing,2023,32:4341‒4354. doi:10.1109/tip.2023.3297408
[38]
HeKaiming, ZhangXiangyu, RenShaoqing,et al.Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Piscataway:IEEE,2016:770‒778. doi:10.1109/cvpr.2016.90
[39]
SimonyanK, ZissermanA.Very deep convolutional networks for large-scale image recognition[EB/OL].(2014‒09‒04)[2024‒10‒20].
[40]
LiJiachen, HassaniA, WaltonS,et al.ConvMLP:Hierarchical convolutional MLPs for vision[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops(CVPRW).Piscataway:IEEE,2023:6307‒6316. doi:10.1109/cvprw59228.2023.00671
[41]
SongTao, SunLeiyu, XieDi,et al.Small-scale pedestrian detection based on topological line localization and temporal feature aggregation[M]//Computer Vision‒ECCV 2018.Cham:Springer International Publishing,2018:554‒569. doi:10.1007/978-3-030-01234-2_33
[42]
CaoJiale, PangYanwei, HanJungong,et al.Taking a look at small-scale pedestrians and occluded pedestrians[J].IEEE Transactions on Image Processing,2020,29:3143‒3152. doi:10.1109/tip.2019.2957927
[43]
ZhouChunluan, YuanJunsong.Bi-box regression for pedestrian detection and occlusion estimation[C]//Computer Vision-ECCV 2018.Cham:Springer International Publishing,2018:138‒154. doi:10.1007/978-3-030-01246-5_9
[44]
HuangXin, GeZheng, ZequnJie,et al.NMS by representative region:Towards crowded pedestrian detection by proposal pairing[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).Piscataway:IEEE,2020:10747‒10756. doi:10.1109/cvpr42600.2020.01076
[45]
ZhangShifeng, WenLongyin, BianXiao,et al.Occlusion-aware R‒CNN:Detecting pedestrians in a crowd[C]//Computer Vision‒ECCV 2018.Cham:Springer International Publishing,2018:657‒674. doi:10.1007/978-3-030-01219-9_39
[46]
LiuMengyin, ZhuChao, WangJun,et al.Adaptive pattern-parameter matching for robust pedestrian detection[J].Proceedings of the AAAI Conference on Artificial Intelligence,2021,35(3):2154‒2162. doi:10.1609/aaai.v35i3.16313
[47]
LinZebin, PeiWenjie, ChenFanglin,et al.Pedestrian detection by exemplar-guided contrastive learning[J].IEEE Transactions on Image Processing,2022,32:2003‒2016. doi:10.1109/tip.2022.3189803
[48]
SongXiaolin, ChenBinghui, LiPengyu,et al.PRNet++:Learning towards generalized occluded pedestrian detection via progressive refinement network[J].Neurocomputing,2022,482:98‒115. doi:10.1016/j.neucom.2022.01.056
[49]
YuanJing, StathakiT, RenGuangyu.Mean height aided post-processing for pedestrian detection[EB/OL].(2024-08-24)[2025-01-08].
[50]
JiangHangzhi, LiaoShengcai, LiJinpeng,et al.Urban scene based semantical modulation for pedestrian detection[J].Neurocomputing,2022,474:1‒12. doi:10.1016/j.neucom.2021.11.091
[51]
LinXinchen, TangYang, ZhaoChaoqiang,et al.Visible attention mechanism-based anchor-free model for pedestrian detection[J].Control Engineering of China,2024,31(3):535‒544.
TangShuyuan, ZhouYiqing, LiJintao,et al.Dual attention pedestrian detector for occlusion scenario based on feature calibration[J].Journal of Xidian University,2024,51(6):25‒39.
LiuMengyin, JiangJie, ZhuChao,et al.VLPD:Context-aware pedestrian detection via vision-language semantic self-supervision[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).Piscataway:IEEE,2023:6662‒6671. doi:10.1109/cvpr52729.2023.00644
[56]
ZhangTao, CaoYahui, ZhangLe,et al.Efficient feature fusion network based on center and scale prediction for pedestrian detection[J].The Visual Computer,2023,39(9):3865‒3872. doi:10.1007/s00371-022-02528-9
[57]
LiuSongtao, HuangDi, WangYunhong.Receptive field block net for accurate and fast object detection[M]//Computer Vision‒ECCV 2018.Cham:Springer International Publishing,2018:404‒419. doi:10.1007/978-3-030-01252-6_24
[58]
HosangJ, BenensonR, SchieleB.Learning non-maximum suppression[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR).Piscataway:2017:6469‒6477. doi:10.1109/cvpr.2017.685
[59]
HuHan, GuJiayuan, ZhangZheng,et al.Relation networks for object detection[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE,2018:3588‒3597. doi:10.1109/cvpr.2018.00378
[60]
ZhangShanshan, ChenDi, YangJian,et al.Guided attention in CNNs for occluded pedestrian detection and re-identification[J].International Journal of Computer Vision,2021,129(6):1875‒1892. doi:10.1007/s11263-021-01461-z
[61]
ZhangYuang, HeHuanyu, LiJianguo,et al.Variational pedestrian detection[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).Piscataway:IEEE,2021:11617‒11626. doi:10.1109/cvpr46437.2021.01145
[62]
WangJianfeng, SongLin, LiZeming,et al.End-to-end object detection with fully convolutional network[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).Nashville,Piscataway:IEEE,2021:15844‒15853. doi:10.1109/cvpr46437.2021.01559
[63]
LiJun, BiYuquan, WangSumei,et al.CFRLA‒net:A context-aware feature representation learning anchor-free network for pedestrian detection[J].IEEE Transactions on Circuits and Systems for Video Technology,2023,33(9):4948‒4961. doi:10.1109/tcsvt.2023.3245613
[64]
HeHaoyang, LiZhishan, TianGuanzhong,et al.Towards accurate dense pedestrian detection via occlusion-prediction aware label assignment and hierarchical-NMS[J].Pattern Recognition Letters,2023,174:78‒84. doi:10.1016/j.patrec.2023.08.019
[65]
ZhangYi, LuoChen.A dynamic label assignment strategy for one-stage detectors[J].Neurocomputing,2024,577:127383. doi:10.1016/j.neucom.2024.127383