Objective Visual inspection is a critical technology for roadside perception in vehicle-road cooperative systems. However, in practical applications, achieving both high detection accuracy and computational efficiency simultaneously remains challenging due to limited computing resources. This study proposes a novel method based on an improved YOLOv5 combined with CombineSORT for image recognition and target tracking, which achieves strong detection performance while maintaining low computational time cost, as demonstrated through experimental results. Methods Firstly, Multi-scale Feature Enhancement (MFE) was applied to the FPN of YOLOv5 to extract shallow target details. This module was mainly composed of Scale Fusion, CombineFPN, and Pixel‒Region Attention. A super-efficient IOU (SEIOU) loss function and network pruning were applied to improve convergence and reduce model complexity. In this process, the loss was calculated based on differences in length, width, and diagonal between the detection boxes and the ground-truth boxes, while batch normalization (BN) layer sparsification was applied for convolutional channel filtering. Secondly, by combining DeepSORT, StrongSORT, and Bot-SORT, a new multi-target tracking method named CombineSORT was presented. In this approach, the basic framework of DeepSORT was adopted, and the BotNet with ResNet50 as the backbone was utilized to extract appearance features. Kalman filtering was replaced by polynomial fitting to improve trajectory smoothness, while the joint similarity matrix from StrongSORT was utilized to match targets with trajectories. Based on the operational procedure of the proposed algorithm, a series of experiments was designed to validate its effectiveness. Using images from real intersections, ablation tests verified the effectiveness and data volume contribution of each improved module. The algorithm was then compared to classical methods using intersection video streams with varying traffic volumes, all of which were executed on a mobile edge computer (MEC) with limited computing resources. Results and Discussions Through ablation tests, the original YOLOv5 achieved an mAP@90 of 0.894 with a parameter quantity of 21.2 M. Scale Fusion, CombineFPN, and Pixel‒Region Attention increased the mAP@90 of the original model to 0.91, 0.923, and 0.916, respectively, while the parameter quantity increased to 24.4, 25.3, and 24.1 M, respectively. The YOLOv5 model integrating all three modules achieved an mAP@90 of 0.939 with a parameter quantity of 31.0 M, after which network pruning reduced the parameter quantity to 6.6 M while maintaining an mAP@90 of 0.937. Through three groups of real intersection experiments, the average recall rates for Group 1 to 3 were 97.68%, 95.83%, and 96.76%, while the multiple object tracking accuracy (MOTA) values were 0.944, 0.890, and 0.910, respectively. Among all target categories, pedestrians and non-motorized vehicles exhibited relatively poor detection performance. Especially in Group 2, the recall rate and MOTA for pedestrians were 89.98% and 0.75, respectively, while those for non‒motorized vehicles were as low as 84.5% and 0.675. This behavior occurred because these two target types had relatively small sizes and did not strictly follow traffic rules, which caused frequent occlusion and increased trajectory prediction difficulty. In addition, the recall rates of buses and trucks were nearly 3 percentage points lower than those of cars, especially in the group, where the recall rates were only 94.81% and 94.92%, respectively. This issue occurred because box trucks and buses exhibited similar appearance features, which increased the likelihood of misidentification from rear perspectives. When comparing the overall processing performance of different algorithms at low-volume intersections, the worst test result achieved a recall rate of 96.54% with a MOTA value of 0.938, while the best result achieved a recall rate of 97.69% with a MOTA value of 0.946. These results indicated that most algorithms achieved good detection performance under sparse target conditions, and lightweight models demonstrated advantages when considering computational resource constraints. However, for high-volume intersections, although the lightweight algorithm based on EfficientNet and ByteTrack exhibited the shortest computation delay, its recall rate and MOTA value were only 91.75% and 0.817, respectively. In contrast, algorithms based on YOLOv5, YOLOX, YOLOv7, and the improved YOLOv5 proposed in this study achieved recall rates ranging from 95.26% to 96.28%, while algorithms combined with DeepSORT, StrongSORT, Bot-SORT, and CombineSORT achieved MOTA values ranging from 0.887 to 0.901. However, most of these methods exhibited computation times exceeding 80 ms, which prevented real-time operation. Among algorithms with computation times below 80 ms, the proposed method based on improved YOLOv5 and CombineSORT achieved the best overall performance, with a recall rate of 96.27% and an MOTA value of 0.900, which confirmed its ability to balance detection accuracy and computational efficiency. Conclusions This study focuses on traffic target perception from a fixed roadside perspective, and the results demonstrate the effectiveness and accuracy of the proposed algorithm. Compared to other commonly used algorithms, the proposed approach simultaneously achieves higher detection accuracy and lower time cost at high-volume intersections, indicating strong application potential in vehicle road collaboration scenarios. For improved engineering practices, further research can be conducted to enhance recognition and tracking performance based on continuous image sequences under adverse weather conditions.
对于p阶多项式的求解问题,设向量 x =(x1,x2,,xn )T和 y =(y1,y2,,yn )T分别表示n个观测样本的自变量与因变量取值,可通过多项式=a0+a1x+a2x2++apxp,np进行拟合,(a0,a1,,ap )为多项式系数向量。则 y 中任意yi' 及其拟合值的误差平方和R2可表示为:
根据最小二乘原理,在式(6)中对(a0,a1,,ap )求偏导,可得:
式中, a =(a0,a1,,ap )T,为 x 的p阶范德蒙德矩阵。求解式(7)即可得到多项式的拟合系数。
BaoXuyan, YuBingyan, WangJing.Research on development status and test method of roadside sensing system in Internet of vehicles[J].Mobile Communications,2021,45(6):43‒47.
LongXuejun, TanZhiguo, GaoFeng.Analysis of application status of multi-sensor fusion roadside sensing technology[J].China ITS Journal,2021(10):137‒140.
LinT Y, GoyalP, GirshickR,et al.Focal loss for dense object detection[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision(ICCV).Venice:IEEE,2017:2999‒3007. doi:10.1109/iccv.2017.324
[8]
TanMingxing, PangRuoming, LeQ V.EfficientDet:Scalable and efficient object detection[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).Seattle:IEEE,2020:10778‒10787. doi:10.1109/cvpr42600.2020.01079
BochkovskiyA, WangC Y, LiaoH M.YOLOv4:Optimal speed and accuracy of object detection[EB/OL].(2020‒04‒23)[2024‒06‒14].
[11]
KimJ H, KimN, ParkY W,et al.Object detection and classification based on YOLO-v5 with improved maritime dataset[J].Journal of Marine Science and Engineering,2022,10(3):377. doi:10.3390/jmse10030377
WangC Y, BochkovskiyA, LiaoH M.YOLOv7:Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).Vancouver:IEEE,2023:7464‒7475. doi:10.1109/cvpr52729.2023.00721
[14]
GeZheng, LiuSongtao, WangFeng,et al.YOLOX:Exceeding YOLO series in 2021[EB/OL].(2021‒08‒06)[2024‒06‒14].doi:10.48550/arXiv.2107.08430
[15]
ZhangMingjiang, WangChengyuan, YangJungang,et al.Research on engineering vehicle target detection in aerial photography environment based on YOLOX[C]//Proceedings of the 2021 14th International Symposium on Computational Intelligence and Design(ISCID).Hangzhou:IEEE,2022:254‒256. doi:10.1109/iscid52796.2021.00066
[16]
BewleyA, GeZongyuan, OttL,et al.Simple online and realtime tracking[C]//Proceedings of the 2016 IEEE International Conference on Image Processing(ICIP).Phoenix:IEEE,2016:3464‒3468. doi:10.1109/icip.2016.7533003
[17]
ZhangYifu, SunPeize, JiangYi,et al.ByteTrack:Multi-object tracking by associating every detection box[M]//Computer Vision-ECCV 2022.Cham:Springer Nature Switzerland,2022:1‒21. doi:10.1007/978-3-031-20047-2_1
[18]
WojkeN, BewleyA, PaulusD.Simple online and realtime tracking with a deep association metric[C]//Proceedings of the 2017 IEEE International Conference on Image Processing(ICIP).Beijing:IEEE,2018:3645‒3649. doi:10.1109/icip.2017.8296962
[19]
DuYunhao, ZhaoZhicheng, SongYang,et al.StrongSORT:Make DeepSORT great again[J].IEEE Transactions on Multimedia,2023,25:8725‒8737. doi:10.1109/tmm.2023.3240881
WuWentong, LiuHan, LiLingling,et al.Application of local fully Convolutional Neural Network combined with YOLO v5 algorithm in small target detection of remote sensing image[J].PLoS One,2021,16(10):e0259283. doi:10.1371/journal.pone.0259283
[22]
ZhangYu, GuoZhongyin, WuJianqing,et al.Real-time vehicle detection based on improved YOLO v5[J].Sustainability,2022,14(19):12274. doi:10.3390/su141912274
[23]
WuTianhao, WangTongwen, LiuYaqi.Real-time vehicle and distance detection based on improved yolo v5 network[C]//Proceedings of the 2021 3rd World Symposium on Artificial Intelligence(WSAI).Guangzhou:IEEE,2021:24‒28. doi:10.1109/wsai51899.2021.9486316
[24]
LinT Y, DollárP, GirshickR,et al.Feature pyramid networks for object detection[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR).Honolulu:IEEE,2017:936‒944. doi:10.1109/cvpr.2017.106
[25]
ZhangZhenhua, QinXueying, ZhongFan.MFE:Multi-scale feature enhancement for object detection[C]//Proceedings of the 32nd British Machine Vision Conference.[S.l.]:BMVA,2021:1‒11. doi:10.5244/c.35.156
[26]
CaiGuanhong, LiGuoping, WangGuozhong,et al.Lightweight traffic-light detection algorithm based on improved YOLOv5s[J].Journal of Shanghai University(Natural Science Edition),2024,30(1):94‒105. doi:10.12066/j.issn.1007-2861.2411
HanSong, PoolJ, TranJ,et al.Learning both weights and connections for efficient neural network[C]//Proceedings of the Advances in Neural Information Processing Systems 28(NIPS 2015).Montreal:NIPS,2015:708.
[29]
LuoWenhan, XingJunliang, MilanA,et al.Multiple object tracking:A literature review[J].Artificial Intelligence,2021,293:103448. doi:10.1016/j.artint.2020.103448
[30]
Del RosarioJ R B, BandalaA A, DadiosE P.Multi-view multi-object tracking in an intelligent transportation system:A literature review[C]//Proceedings of the 2017 IEEE 9th International Conference on Humanoid,Nanotechnology,Information Technology,Communication and Control,Environment and Management(HNICEM).Manila:IEEE,2017:1‒4. doi:10.1109/hnicem.2017.8269524
[31]
LuoHao, JiangWei, GuYouzhi,et al.A strong baseline and batch normalization neck for deep person re-identification[J].IEEE Transactions on Multimedia,2020,22(10):2597‒2609. doi:10.1109/tmm.2019.2958756
[32]
XuWei, DuXiaodong, LiRuochen,et al.Attention-enhanced StrongSORT for robust vehicle tracking in complex environments[J].Scientific Reports,2025,15:17472. doi:10.1038/s41598-025-99524-5
[33]
LiTingting, LiZhanbo, MuYuhong,et al.Pedestrian multi-object tracking based on YOLOv7 and Bot-SORT[C]//Proceedings of SPIE Conference on Third International Conference on Computer Vision and Pattern Analysis(ICCPA 2023).Hangzhou:SPIE,2023:369‒374. doi:10.1117/12.2684256
[34]
BernardinK, StiefelhagenR.Evaluating multiple object tracking performance:The CLEAR MOT metrics[J].EURASIP Journal on Image and Video Processing,2008,2008(1):246309. doi:10.1155/2008/246309