Aiming at the problem that the existing siamese network only use spatial information, and face the challenges of object obstruction, disappearance, apparent severe deformation and so on, which leads to the decrease of tracking accuracy, a temporal salient attention siamese tracking network is proposed. Through the information exchange “bridge”, the network on the one hand adds salient attention to the current frame, and guides the network to focus on learning the object characteristics; on the other hand, the features of historical object in the memory network are screened, and they are used as additional templates to provide the external appearance information of object, at the same time, the changing rules of the external information and spatial position of object are studied to guide the subsequent detection and classification process. In order to further improve the ability of temporal attention, a multi-scale feature extraction unit is proposed to make up for the insufficient feature extraction of backbone network. The model is tested on Got-10k data set, and compared with the object tracking algorithm STMTrack, the value is improved by 2.4%. According to the visualization results, this network has higher accuracy in the challenges of object obstruction and disappearance.
BertinettoL, ValmadreJ, HenriquesJ F, et al. Fully-convolutional siamese networks for object tracking[C]∥ European Conference on Computer Vision, Berlin, Germany, 2016: 850-865.
[2]
LiB, YanJ J, WuW, et al. High performance visual tracking with siamese region proposal network[C]∥ Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 8971-8980.
[3]
FanH, LingH B. Siamese cascaded region proposal networks for real-time visual tracking[C]∥ Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 7952-7961.
[4]
LiB, WuW, WangQ, et al. Siamrpn++: evolution of siamese visual tracking with very deep networks[C]∥ Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, 2019: 4282-4291.
[5]
XuY D, WangZ Y, LiZ X, et al. Siamfc++: towards robust and accurate visual tracking with target estimation guidelines[C]∥ Proceedings of the AAAI Conference on Artificial Intelligence, New York, USA, 2020: 12549-12556.
[6]
GuptaD K, AryaD, GavvesE. Rotation equivariant siamese networks for tracking[C]∥ Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 12362-12371.
[7]
YangT Y, ChanA B. Learning dynamic memory networks for object tracking[C]∥ Proceedings of the European Conference on Computer Vision (ECCV), Munichi, Germany, 2018: 152-167.
[8]
YanB, PengH W, FuJ L, et al. Learning spatio-temporal transformer for visual tracking[C]∥ Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, Canada, 2021: 10448-10457.
[9]
FuZ H, LiuQ J, FuZ H, et al. Stmtrack: template-free visual tracking with space-time memory networks[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 13774-13783.
[10]
ZhangZ P, PengH W, FuJ L, et al. Ocean: object-aware anchor-free tracking[C]∥European Conference on Computer Vision, Berlin, Germany, 2020: 771-787.
[11]
VoigtlaenderP, LuitenJ, TorrP H, et al. Siam R-CNN: visual tracking by re-detection[C]∥ Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 6578-6588.
[12]
EomC, LeeG, LeeJ, et al. Video-based person re-identification with spatial and temporal memory networks[C]∥ Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, Canada, 2021: 12036-12045.
[13]
OhS W, LeeJ Y, XuN, et al. Video object segmentation using space-time memory networks[C]∥ Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, South Korea, 2019: 9226-9235.
[14]
XieH Z, YaoH X, ZhouS C, et al. Efficient regional memory network for video object segmentation[C]∥ Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 1286-1295.
[15]
PaulM, DanelljanM, VanG L, et al. Local memory attention for fast video semantic segmentation[C]∥ 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 2021: 1102-1109.
[16]
WangH, WangW N, LiuJ. Temporal memory attention for video semantic segmentation[C]∥ 2021 IEEE International Conference on Image Processing (ICIP), Anchorage, USA, 2021: 2254-2258.
[17]
YuF, WangD Q, ShelhamerE, et al. Deep layer aggregation[C]∥ Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 2403-2412.
[18]
SzegedyC, VanhouckeV, IoffeS, et al. Rethinking the inception architecture for computer vision[C]∥ Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 2818-2826.
[19]
TianZ, ShenC H, ChenH, et al. Fully convolutional one-stage object detection[C]∥ 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, South Korea, 2019: 9626-9635.
[20]
LinT Y, GoyalP, GirshickR, et al. Focal loss for dense object detection[C]∥ Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 2017: 2980-2988.
[21]
HuangL H, ZhaoX, HuangK Q. Got-10k: a large high-diversity benchmark for generic object tracking in the wild[J]. IEEE Transactions on Pattern Analysis, Intelligence Machine, 2019, 43(5): 1562-1577.
[22]
CuiY T, JiangC, WangL M, et al. Mixformer: end-to-end tracking with iterative mixed attention[C]∥ Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 13608-13618.
[23]
XieF, WangC Y, WangG T, et al. Correlation-aware deep tracking[C]∥ Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 8751-8760.
[24]
WangN, ZhouW G, WangJ, et al. Transformer meets tracker: exploiting temporal context for robust visual tracking[C]∥ Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 1571-1580.
[25]
ZhangZ P, LiuY H, WangX, et al. Learn to match: automatic matching network design for visual tracking[C]∥ Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, Canada, 2021: 13339-13348.
[26]
CuiY T, JiangC, WangL M, et al. Fully convolutional online tracking[J]. Computer Vision and Image Understanding, 2022, 224: 103547.
[27]
LukezicA, MatasJ, KristanM. D3S-a discriminative single shot segmentation tracker[C]∥ Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 7133-7142.
[28]
MayerC, DanelljanM, BhatG, et al. Transforming model prediction for tracking[C]∥Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 8731-8740.
[29]
ZhouZ K, PeiW J, LiX, et al. Saliency-associated object tracking[C]∥ Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, Canada, 2021: 9866-9875.
[30]
BhatG, DanelljanM, GoolL V, et al. Know your surroundings: exploiting scene information for object tracking[C]∥ European Conference on Computer Vision, Berlin, Germany, 2020: 205-221.
[31]
YuY C, XiongY L, HuangW L, et al. Deformable siamese attention networks for visual object tracking[C]∥ Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 6728-6737.
[32]
BhatG, JohnanderJ, DanelljanM, et al. Unveiling the power of deep tracking[C]∥ Proceedings of the European Conference on Computer Vision (ECCV), Munichi, Germany, 2018: 483-498.
[33]
ChenZ D, ZhongB E, LiG R, et al. SiamBAN: target-aware tracking with siamese box adaptive network[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 45(4): 5158-5173.
[34]
BhatG, DanelljanM, GoolL V, et al. Learning discriminative model prediction for tracking[C]∥ Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, South Korea, 2019: 6182-6191.