South-Central Minzu University,a. College of Computer Science; b. Hubei Provincial Engineering Research Center for Intelligent Management of Manufacturing Enterprise; c. Hubei Provincial Engineering Research Center of Agricultural Blockchain and Intelligent Management,Wuhan 430074,China
Aiming at the problem that existing human skeleton action recognition algorithms cannot fully explore the spatiotemporal features of motion, an improved graph convolutional network model based on fusion of spatiotemporal attention is proposed. This model includes spatial attention mechanism and temporal attention mechanism, utilizing spatiotemporal attention mechanism to extract global spatiotemporal features of actions from both temporal and spatial dimensions. Integrating these two into a unified spatiotemporal graph convolutional network (ST-GCN) framework enables end-to-end training. Comparative experiments on two publicly available datasets, Kinetics and NTU RGB+D, have shown that the improved model achieves a Top-1 accuracy of 82.37% under the CS standard on the NTU RGB+D dataset, and a Top-1 accuracy of 89.84% under the CV standard. Compared with the original ST-GCN algorithm, the improved model achieves a Top-1 accuracy of 0.87% and a Top-5 accuracy of 1.54%, respectively. On the Kinetics dataset, the improved model achieved an accuracy of 31.78%, which is 1.08% higher than ST-GCN. This validates the effectiveness of the improved method.
FUL, ZHANGJ, HUANGK. Beyond tree structure models: A new occlusion aware graphical model for human pose estimation[C]//2015 IEEE International Conference on Computer Vision (ICCV). Santiago: IEEE, 2015: 1976-1984.
[3]
VEMULAPALLIR, ARRATEF, CHELLAPPAR. Human action recognition by representing 3D skeletons as points in a lie group[C]//2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus: IEEE, 2014: 588-595.
[4]
FERNANDOB, GAVVESE, JOSE ORAMASM, et al. Modeling video evolution for action recognition[C]//2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston: IEEE, 2015: 5378-5387.
[5]
LIUJ, SHAHROUDYA, XUD, et al. Spatio-temporal LSTM with trust gates for 3D human action recognition[C]//Computer Vision - ECCV 2016. Amsterdam: Springer, 2016: 816-833.
[6]
KEQ, BENNAMOUNM, ANS, et al. A new representation of skeleton sequences for 3D action recognition[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu: IEEE, 2017: 4570-4579.
[7]
LIC, ZHONGQ, XIED, et al. Skeleton-based action recognition with convolutional neural networks[C]//2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW). Hong Kong: IEEE, 2017: 597-600.
[8]
DUY, WANGW, WANGL. Hierarchical recurrent neural network for skeleton based action recognition[C]//2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston: IEEE, 2015: 1110-1118.
[9]
SONGS, LANC, XINGJ, et al. An end-to-end spa-tio-temporal attention model for human action recognition from skeleton data[C]//Proceedings of the 31st AAAI Conference on Artificial Intelligence. San Francisco: AAAI Press, 2017: 4263-4270
[10]
XUK, HUW, LESKOVECJ, et al. How powerful are graph neural networks[J]. arXiv Preprint arXiv: 2018.
[11]
ZHANGM, CHENY. Link prediction based on graph neural networks[C]//Proceedings of the 32nd International Conference on Neural Information Processing Systems. Montréal: ACM, 2018: 5171-5181.
[12]
QIS, WANGW, JIAB, et al. Learning human-object interactions by graph parsing neural networks[C]//Proceedings of the 2018 European Conference on Computer Vision. Munich: Springer, 2018: 407-423.
[13]
LIR, TAPASWIM, LIAOR, et al. Situation recognition with graph neural networks[C]//2017 IEEE International Conference on Computer Vision (ICCV). Venice: IEEE, 2017: 4183-4192.
SIMONOVSKYM, KOMODAKISN. Dynamic edge-conditioned filters in convolutional neural networks on graphs[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu:IEEE, 2017: 29-38.
[16]
YanS J, XiongY J, LinD H. Spatial temporal graph convolutional network for skeleton-based action recognition[C]//Proceedings of the 32nd AAAI Conference on Artificial Intelligence. New Orleans: AAAI Press, 2018: 7444-7452
KAY W, CARREIRAJ, SIMONYANK, et al. The kinetics human action video dataset[J]. arXiv Preprint arXiv: 2017.
[19]
CAOZ, SIMONT, WEIS E, et al. Realtime multi-person 2D pose estimation using part affinity fields[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu: IEEE, 2017: 7291-7299.
[20]
SHAHROUDYA, LIUJ, NGT T, et al. NTU RGB+D: A large scale dataset for 3D human activity analysis[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas: IEEE, 2016: 1010-1019.