The high degree of freedom of human limbs often constitutes complex poses in which the key points are prone to occluded, and locating the occluded key points is one of the difficulties in human pose estimation. To this end, this paper proposed a method with a guided graph structure and enhanced key points location information. The method incorporates a location information enhancement module in the HRNet, which can improve the representation of the spatial location information of visible key points. A visual graph neural module is integrated into backbone network to extract relevant features containing key points and exploit the local and global topological connectivity relationships between key points in pixel coordinate space to infer the location information of the occluded key points. Finally, a heatmap aggregation unit and a semantic graph convolutional network are employed to update the affinity weights between key points in the semantic space, which can represent the topological dependencies between key points under the constraints of the skeleton structure and further optimize the estimation of the occluded key points. The proposed model achieves an average accuracy of 78.1% on the COCO2017 test set, and can accurately estimate the occluded key points prone to occlusion in complex poses.
正确关键点百分比(Percentage of correct keypoints,PCK)作为MPII评估模型关键点估计准确度的标准,统计了被准确检测的关键点所占比例。目前,姿态估计领域普遍采用头部尺寸因子的50%即PCKh@0.5作为归一化指标,并通过设定不同的阈值计算最终的PCK平均值,用于统计正确预测的关键点占总数的比例。PCK平均值的计算公式为:
TianHao-yu, MaXin, LiYi-bin. Abnormal gait recognition method based on skeleton information[J]. Journal of Jilin University(Engineering and Technology Edition), 2022, 52(4): 725-737.
[4]
LecunY, BottouL, BengioY, et al. Gradient-based learning applied to document recognition[J]. Proceedings of the IEEE, 1998, 86(11): 2278-2324.
[5]
ToshevA, SzegedyC. DeepPose: Human pose estimation via deep neural networks[C]∥IEEE Conference on Computer Vision and Pattern Recognition, Columbus, USA, 2014: 1653-1660.
[6]
TompsonJ, JainA, LecunY, et al. Joint training of a convolutional network and a graphical model for human pose estimation[C]∥Neural Information Processing Systems,Montreal, Canada, 2014: 1799-1807.
[7]
NewellA, YangK, DengJ. Stacked hourglass networks for human pose estimation[C]∥European Conference on Computer Vision, Amsterdam, Netherlands, 2016: 483-499.
[8]
ChenY L, WangZ C, PengY X, et al. Cascaded pyramid network for multi-person pose estimation[C]∥IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 7103-7112.
[9]
XiaoB, WuH P, WeiY C. Simple baselines for human pose estimation and tracking[C]∥European Conference on Computer Vision, Munich, Germany, 2018: 472-487.
[10]
SunK, XiaoB, LiuD, et al. Deep high-resolution representation learning for human pose estimation [C]∥IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, 2019: 5686-5796.
[11]
VaswaniA, ShazeerN, ParmarN, et al. Attention is all you need[C]∥Neural Information Processing Systems(NeurIPS),Long Beach, USA, 2017: 5998-6008.
[12]
DosovitskiyA, BeyerL, KolesnikovA, et al. An image is worth 16×16 words: transformers for image recognition at scale[C]∥International Conference on Learning Representations, Online, 2021.
[13]
LiY J, ZhangS K, WangZ C, et al. Tokenpose: Learning keypoint tokens for human pose estimation[C]∥Proceedings of the IEEE International Conference on Computer Vision(ICCV),Montreal, Canda, 2021: 11293-11302.
[14]
YuanY H, FuR, HuangL, et al. Hrformer: high-resolution transformer for dense prediction[J]. Advances in Neural Information Processing Systems, 2021, 34: 7281-7293.
[15]
YangS, QuanZ B, NieM, et al. Transpose: Keypoint localization via transformer[C]∥Proceedings of the IEEE International Conference on Computer Vision, Montreal, Canda, 2021: 11782-11792.
[16]
LiG H, MüllerM, ThabetA, et al. DeepGCNs: Can GCNs Go As Deep As CNNs?[C]∥IEEE International Conference on Computer Vision, Seoul, South Korea, 2019: 9266-9275.
[17]
QiuL T, ZhangX Y Y, LiY R, et al. Peeking into occluded joints: A novel framework for crowd pose estimation[C]∥European Conference on Computer Vision, Glasgow, UK, 2020: 488-504.
[18]
BinY R, ChenZ M, WeiX S, et al. Structure-aware human pose estimation with graph convolutional networks[J]. Pattern Recognition, 2020, 106: No.107410.
[19]
WangJ, LongX, GaoY, et al. Graph-PCNN: Two stage human pose estimation with graph pose refinement[C]∥European Conference on Computer Vision, Glasgow, UK, 2020: 492-508.
[20]
BanikS, GarcÍaA M, KnollA. 3D human pose regression using graph convolutional network[C]∥IEEE International Conference on Image Processing(ICIP), Anchorage, USA, 2021: 924-928.
[21]
HouQ B, ZhouD Q, FengJ S. Coordinate attention for efficient mobile network design[C]∥IEEE Conference on Computer Vision and Pattern Recognition. Nashville, USA, 2021: 13708-13717.
[22]
HanK, WangY H, GuoJ Y, et al. Vision gnn: An image is worth graph of nodes[J]. Advances in Neural Information Processing Systems, 2022, 35: 8291-8303.
[23]
HuangG, LiuZ, LaurensV D M, et al. Densely connected convolutional networks[C]∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR), Honolulu, USA, 2017: 4700-4708.
[24]
DingX H, GuoY C, DingG G, et al. ACNet: Strengthening the kernel skeletons for powerful CNN via asymmetric convolution blocks[C]∥International Conference on Computer Vision, Seoul, South Korea, 2019: 1911-1920.
[25]
ZhaoL, PengX, TianY, et al. Semantic graph convolutional networks for 3D Human Pose Regression[C]∥IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, 2019: 3420-3430.
[26]
WangX L, GirshickR, GuptaA, et al. Non-local neural networks[C]∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR), Salt Lake City, USA, 2018: 7794-7803.
[27]
YangJ W, LuJ S, LeeS, et al. Graph R-CNN for scene graph generation[C]∥European Conference on Computer Vision, Munich, Germany, 2018: 690-706.
AndrilukaM, PishchulinL, GehlerP, et al. 2D human pose estimation: New benchmark and state of the art analysis[C]∥IEEE Conference on Computer Vision and Pattern Recognition, Columbus, USA, 2014: 3686-3693.
[30]
LinT Y, MaireM, BelongieS, et al. Microsoft COCO: Common objects in context[C]∥Proceedings of the European Conference on Computer Vision(ECCV), Zurich, the Switzerland, 2014: 740-755.
[31]
ZhangK, HeP, YaoP, et al. Learning enhanced resolution-wise features for human pose estimation[C]∥IEEE International Conference on Image Processing, Abu Dhabi, United Arab Emirates, 2020: 2256-2260.
[32]
WangR, WuW Y, WangX Y. Enhancing multi-scale information exchange and feature fusion for human pose estimation[J]. The Visual Computer, 2023, 39(10): 4751-4765.
[33]
TranT D, VoX T, NguyenD L, et al. High-resolution network with attention module for human pose estimation[C]∥Asian Control Conference, Jeju Island, South Korea, 2022: 459-464.
[34]
DongK W, SunY J, ChengX Z, et al. Combining detailed appearance and multi-scale representation: A structure-context complementary network for human pose estimation[J]. Applied Intelligence, 2023, 53(7): 8097-8113.
[35]
SoomroK, ZamirA R, ShahM. UCF101: a dataset of 101 human actions classes from videos in the wild[J/OL].[2023-08-16].