In video captioning tasks, one of the common problems is that the object caption is not specific enough, mainly because the model does not fully learn the information of the objects in the video. Meanwhile, videos contain abundant feature information, such as object information, motion information, and contextual information, making it a challenging task to enhance the model’s ability to learn key information when generating captions. To address the aforementioned problems, this paper proposes a method based on enhanced object learning and attention networks. Firstly, a new enhanced object learning module was designed to fully learn object information in videos, thereby achieving accurate caption of video content. Secondly, an attention network was constructed to dynamically adjust the weights of different types of information, thereby enhancing the model’s ability to learn key information when generating captions. In the experiments on the MSVD and MSR-VTT datasets, the caption generated by the method proposed in this paper showed a higher level of specificity and accuracy, and exceeded the current advanced methods in various evaluation indicators, effectively verifying the feasibility of the method.
从图4(a)~(d)中可以看出,EOLM-AN模型在生成视频描述方面表现出几个方面的优势。首先,在图4(a)(b)中,相较于HMN模型将视频中倒入锅里的液体和女孩手里所拿的东西识别为“water”和“ball”,EOLM-AN模型能准确识别为“oil”和“egg”。这表明EOLM-AN模型能更有效地学习视频中的对象信息,并生成更具体和准确的描述。其次,在图4(c)(d)中,HMN模型生成的描述是“a monkey is fighting(一只猴子在打架)”和“two girls are playing with toys(两个女孩在玩玩具)”,而EOLM-AN模型生成的描述是“a monkey is doing karate(一只猴子在练空手道)”和“a girl is painting eggs(一个女孩正在画鸡蛋)”。这表明EOLM-AN模型能更准确地理解视频内容,并生成更符合视频内容的语义描述。
从图4(e)(f)中可以观察到,EOLM-AN模型在处理长时间跨度视频时,能够捕捉到每个动作的一些特征。然而,由于缺乏足够的上下文信息,该模型无法准确理解这些动作之间的关系。因此,它只能生成局部动作的描述,比如“a man is slicing a lemon(一个男人正在切柠檬)”和“a person is folding a piece of paper(一个人正在折叠一张纸)”。这表明EOLM-AN模型没有实现对整个视频内容的整体理解。
ZhangJ, PengY. Video captioning with object-aware spatio-temporal correlation and aggregation[J]. IEEE Transactions on Image Processing, 2020, 29: 6209-6222.
[2]
ZanfirM, MarinoiuE, SminchisescuC. Spatio-temporal attention models for grounded video captioning[C]∥Computer Vision-ACCV 2016: 13th Asian Conference on Computer Vision, Taipei, Taiwan, 2017: 104-119.
[3]
YangZ, HanY, WangZ. Catching the temporal regions-of-interest for video captioning[C]∥Proceedings of the 25th ACM International Conference on Multimedia, Mountain View, USA, 2017: 146-153.
[4]
ZhangW, WangX E, TangS, et al. Relational graph learning for grounded video description generation[C]∥Proceedings of the 28th ACM International Conference on Multimedia, Seattle, USA, 2020: 3807-3828.
[5]
ZhangZ, ShiY, YuanC, et al. Object relational graph with teacher-recommended learning for video captioning[C]∥Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 13278-13288.
[6]
KananiC S, SahaS, BhattacharyyaP. Global object proposals for improving multi-sentence video descriptions[C]∥International Joint Conference on Neural Network, Montreal, Canada, 2021: 1-7.
[7]
ParisottoE, SongF, RaeJ, et al. Stabilizing transformers for reinforcement learning[C]∥International Conference on Machine Learning, Vienna, Austria, 2020: 7487-7498.
[8]
YeH, LiG, QiY, et al. Hierarchical modular network for video captioning[C]∥Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 17939-17948.
[9]
LinK, LiL, LinC C, et al. SwinBERT: end-to-end transformers with sparse attention for video captioning[C]∥Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 17949-17958.
[10]
GuX, ChenG, WangY, et al. Text with knowledge graph augmented transformer for video captioning[C]∥Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, USA, 2023: 18941-18951.
[11]
JingS, ZhangH, ZengP, et al. Memory-based augmentation network for video captioning[J]. IEEE Transactions on Multimedia, 2023, 26: 2367-2379.
[12]
ShenY, GuX, XuK, et al. Accurate and fast compressed video captioning[C]∥Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2023: 15558-15567.
[13]
WangJ, JiangW, MaL, et al. Bidirectional attentive fusion with context gating for dense video captioning[C]∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 7190-7198.
[14]
WangB, MaL, ZhangW, et al. Controllable video captioning with pos sequence guidance based on gated fusion network[C]∥Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 2019: 2641-2650.
[15]
LiL, GaoX, DengJ, et al. Long short-term relation transformer with global gating for video captioning[J]. IEEE Transactions on Image Processing, 2022, 31: 2726-2738.
[16]
XuJ, YaoT, ZhangY, et al. Learning multimodal attention LSTM networks for video captioning[C]∥Proceedings of the 25th ACM International Conference on Multimedia, Mountain View, USA, 2017: 537-545.
[17]
SunZ, ChenS, ZhongL. Visual-aware attention dual-stream decoder for video captioning[C]∥IEEE International Conference on Multimedia and Expo, Taipei, Taiwan, 2022: 1-6.