In this paper, a typical model of attention mechanism in image classification tasks, Vision Transformer (ViT) is applied to the task of tree species classification and recognition, aiming to explore a more accurate and efficient tree species recognition model. A total of three sets of comparative experiments are designed in this paper:(1) ViT and ResNet50 are used for training, validation and testing on the dataset in the experimental environment, (2) ViT model is set to different depths for training, (3) ViT and ResNet50 are used in real environment training, validation and testing on the dataset.The results showed that the classification performance of the ViT model is the same as that of the ResNet50 model, whether it was the experimental environment dataset or the real environment dataset, and the time efficiency of the ViT model is significantly better than that of the ResNet 50 model. In addition, this paper also shows the class activation heat map when classifying images of the real environment. It is found that the ViT model pays more attention to the leaves themselves, especially the leaf edges, while ignoring the complex background.The two models are comparable in classification accuracy, but ViT has significantly faster convergence speed, stronger ability to learn features, and stronger generalization ability. By reducing the network depth, the time efficiency of ViT is further improved. This study is a useful attempt to apply ViT to the specific task of tree species classification and identification. It also lays the foundation for the subsequent research on tree species identification with higher efficiency, smaller data requirements, and real-world datasets of Plateau Forestry by integrating the advantages of ViT and CNN.
Chaki,Jyotismita, ParekhRanjan, and BhattacharyaSamar. "Plant leaf recognition using texture and shape features with neural classifiers."Pattern Recognition Letters58 (2015): 61-68.
[2]
Yang, Chengzhuan. "Plant leaf recognition by integrating shape and texture features."Pattern Recognition112 (2021):107809.
LiX, FanW, WangY, et al. Detecting Plant Leaves Based on Vision Transformer Enhanced YOLOv5[C]//2022 3rd International Conference on Pattern Recognition and Machine Learning (PRML). IEEE, 2022: 32-37.
[11]
ThakurP S, KhannaP, SheoreyT, et al. Vision Transformer for Plant Disease Detection:PlantViT[C]//International Conference on Computer Vision and Image Processing. Springer, Cham, 2022: 501-511.
[12]
GuoM H, XuT X, LiuJ J, et al. Attention Mechanisms in Computer Vision: A Survey[J]. arXiv preprint arXiv:2021.
[13]
RaghuM, UnterthinerT, KornblithS, et al. Do vision transformers see like convolutional neural networks?[J]. Advances in Neural Information Processing Systems, 2021, 34.
[14]
Pieter-Tjerk de Boeret al. A Tutorial on the Cross-Entropy Method.[J]. Annals OR, 2005, 134(1) : 19-67.
[15]
Nitish Srivastavaet al. Dropout: a simple way to prevent neural networks from overfitting.[J]. Journal of Machine Learning Research, 2014, 15(1) : 1929-1958.