Human motion recognition is able to provide basic services for many different applications such as healthcare, safety, and entertainment, and gradually become a hotspot in the related research field. To solve the problems of high computational complexity and large parameter count in Visual Transformer (ViT), a MAPFormer framework model is proposed by utilizing the linear complexity of pooling and sequence length, as well as its advantages of unnecessary of parameters. The model introduces parallel pooling modules to replace the multi head attention module of ViT and uses deep separable convolutions to enhance local feature, meanwhile, the parameter count is further reduced. The application of method to human action recognition tasks is able to improve the accuracy of human motion recognition. The experimental results achieved 88.3% and 89.1% of the accuracy in the Minimagnet dataset and MS COCO dataset, respectively, which increased by 4.3% and 2.1% of accuracy and decreased by 65.2 M and 58.3 M in the parameter count compared with ViT.
YUANL, CHENY, WANGT,et al.Tokens-to-Token ViT:Training vision transformers from scratch on ImageNet[J].arXiv preprint arXiv,2021:2101.11986.
[5]
D'ASCOLIS, TOUVRONH, LEAVITTM L,et al.ConViT:Improving vision transformers with soft convolutional inductive biases[J].Journal of Statistical Mechanics: Theory and Experiment,2022(11):114005.
[6]
HANK, XIAOA, WUE,et al.Transformer in transformer[J].Advances in Neural Information Processing Systems,2021,34:15908-15919.
[7]
WANGW, XIEE, LIX,et al.Pyramid vision transformer:A versatile backbone for dense prediction without convolutions[J].arXiv preprint arXiv,2021:2102.12122.
[8]
ZHOUD, SHIY, KANGB,et al.Refiner:Refining self-attention for vision transformers[J].arXiv preprint arXiv,2021:2106.03714.
[9]
TOLSTIKHINI O, HOULSBYN, KOLESNIKOVA,et al.Mlp-mixer:An all-mlp architecture for vision[J].Advances in Neural Information Processing Systems,2021,34:24261-24272.
[10]
HOUQ, JIANGZ, YUANL,et al.Vision permutator:A permutable mlp-like architecture for visual recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence,2022,45(1):1328-1334.
[11]
LIUH, DAIZ,SO D,et al.Pay attention to mlps[J].Advances in Neural Information Processing Systems,2021,34: 9204-9215.
[12]
TOUVRONH, BOJANOWSKIP, CARONM,et al.Resmlp:Feedforward networks for image classification with data-efficient training[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2022,45(4):5314-5321.
[13]
VASWANIA, SHAZEERN, PARMARN,et al.Attention is all you need[J].Advances in Neural Information Processing Systems,2017,30:6000-6010.
[14]
CARIONN, MASSAF, SYNNAEVEG, et al. End-to-end object detection with transformers[C]//European Conference on Computer Vision. Cham: Springer International Publishing,2020:213-229.
[15]
CHENY, KALANTIDISY, LIJ, et al. A2-Nets: Double attention networks[J]. Advances in Neural Information Processing Systems,2018,31:350-359.
[16]
RAMACHANDRANP, PARMARN, VASWANIA,et al.Stand-alone self-attention in vision models[J].Advances in Neural Information Processing Systems,2019,32:68-80.
[17]
VASWANIA, RAMACHANDRANP, SRINIVASA,et al.Scaling local self-attention for parameter efficient visual backbones[J].arXiv preprint arXiv,2021:2103.12731.
IOFFES, SZEGEDYC.Batch normalization:Accelerating deep network training by reducing internal covariate shift[J].arXiv preprint arXiv,2015:1502.03167.
[21]
HENDRYCKSD, GIMPELK.Gaussian error linear units (gelus)[J].arXiv preprint arXiv,2016:1606.08415.
[22]
NAIRV, HINTONG E.Rectified linear units improve restricted boltzmann machines[C]//Proceedings of The 27th International Conference on Machine Learning (ICML-10).Madison:Omnipress,2010:807-814.
[23]
WUK, PENGH, CHENM,et al.Rethinking and improving relative position encoding for vision transformer[J].arXiv preprint arXiv,2021:2107.14222.
LEE-THORPJ, AINSLIEJ, ECKSTEINI, et al. Fnet: Mixing tokens with fourier transforms[J].arXiv preprint arXiv,2021:2105.03824.
[28]
MARTINSA, FARINHASA, TREVISOM, et al. Sparse and continuous attention mechanisms[J].Advances in Neural Information Processing Systems,2020,33:20989-21001.
RAOY, ZHAOW, ZHUZ, et al.Global filter networks for image classification[J].Advances in Neural Information Processing Systems,2021,34:980-993.
[31]
YUW, LUOM, ZHOUP,et al.Metaformer is actually what you need for vision[J].arXiv preprint arXiv,2021:2111.11418.
[32]
DONGY, CORDONNIERJ B, LOUKASA.Attention is not all you need: Pure attention loses rank doubly exponentially with depth[J].arXiv preprint arXiv,2021:2103.03404.
[33]
RAGHUM, UNTERTHINERT, KORNBLITHS, et al.Do vision transformers see like convolutional neural networks?[J].Advances in Neural Information Processing Systems,2021,34:12116-12128.
[34]
LINJ, GANC, HANS.Temporal shift module for efficient video understanding[J].arXiv preprint arXiv,2018:1811.08383.