Multimodal aspect-based sentiment analysis (MABSA) is a critical research direction in the field of affective computing, aiming to integrate multimodal information, such as text, images, and audio to achieve fine-grained analysis of sentiment toward specific aspects. Current research in MABSA faces challenges such as image noise interference and excessive reliance on local features, which compromise the accuracy and comprehensiveness of the analysis. To address these limitations, this paper proposes an innovative global-local interactive emotion analysis model (GLIEAM). On the one hand, the model employs a tandem architecture of vision transformer (Vision Transformer, ViT) and generative pre-trained transformer (GPT) to generate image descriptions, which are then concatenated with original text features, significantly enhancing information fusion and enabling a more comprehensive capture of emotional cues in multimodal data. On the other hand, to mitigate image noise, a hybrid approach combining wavelet transform and non-local means is applied for image denoising. Additionally, convolutional neural networks (CNN) and tokens-to-token vision transformer (T2T-ViT) are utilized to extract local and global image features, respectively, avoiding over-reliance on local features and achieving balanced and holistic image feature extraction. Experimental results on benchmark datasets demonstrate that the proposed method outperforms existing approaches, the accuracy reached 78.46% on the Twitter-15 dataset and 75.21% on the Twitter-17 dataset, particularly exhibiting superior performance in low-resource scenarios.
在自然语言处理领域,文本特征提取技术的发展是一个从简单到复杂、从静态到动态的进阶过程。早期,词袋模型(Bag of Words, BoW)和词频-逆文档频率方法[1]通过统计词频来对文本进行表征。这两种方法实现起来简单便捷,然而,它们存在明显的局限性,无法有效捕捉文本中的词序和语义信息。这就好比拼图时,只关注每一块拼图的数量,却忽略了它们之间的拼接顺序和整体图案的含义。为了突破这一局限,词嵌入技术横空出世,像Word2Vec、GloVe和FastText[9],它们将单词映射到低维向量空间,显著提升了语义表示能力。这就像是给每个单词赋予了一个独特的“语义指纹”,让模型能更好地理解单词的含义。不过,这些方法生成的是静态词向量,在面对上下文变化时,表现得有些力不从心。为了进一步捕捉上下文信息,上下文词嵌入方法应运而生,其中代表性的有ELMo、双向编码器表示Transformer(Bidirectional Encoder Representations from Transformers, BERT)和GPT[10]。这些模型利用预训练语言模型,生成能够根据上下文动态变化的词向量。这一突破,极大地提升了模型对复杂语言现象的理解能力,就像为模型戴上了一副“上下文理解眼镜”,让它能更准确地把握文本的内涵。此外,主题模型(如隐含狄利克雷分布(Latent Dirichlet Allocation,LDA))、句法特征、字符级特征以及图嵌入等方法,也在特定的自然语言处理任务中展现出独特的优势,为解决不同类型的文本分析问题提供了多样化的思路。近年来,预训练语言模型(如BERT、GPT和T5)[11]成为了自然语言处理领域的主流。这些模型通过在大规模语料上进行预训练,然后针对不同的自然语言处理(Natural Language Processing, NLP)任务进行微调,在多种任务上都取得了突破性的进展,引领了自然语言处理技术的新变革。在句法解析领域,传统规则方法逐渐被神经架构替代。Wang等[12]在BERT中引入图传播层,显式建模依存树结构,在PTB(Penn Tree Bank)数据集无标记依存正确率(Unlabeled Attachment Score,UAS)中高达96.20%;Xu等[13]提出跨语言依存投影框架,利用多语言BERT对齐语法结构,在50种语言上平均带标签依存关系得分(Labeled Attachment Score,LAS)达78.30%。Zhang等[14]通过异构图网络实现依存关系与语义角色的联合推理。
1.2 图像特征提取
图像特征提取技术经历了从传统手工设计到深度学习驱动下多模态融合的深刻变革,实现了显著的技术飞跃。早期,图像特征提取依赖于手工设计的特征,如尺度不变特征变换(Scale-invariant Feature Transform,SIFT)、方向梯度直方图(Histogram of Oriented Gradient,HOG)和颜色直方图[15],这些方法在尺度变换和光照变化下展现了一定的鲁棒性,但难以捕捉图像的深层语义信息,限制了其在复杂场景中的应用。随着深度学习的兴起,以AlexNet卷积神经网络、视觉几何组(Visual Geometry Group,VGG)和残差网络(Residual Network,ResNet)[16]为代表的预训练卷积神经网络通过端到端学习,自动提取具有丰富语义的图像特征,使得图像在视觉表达上具备更强的辨识度与区分度,进而显著提升了图像在各类视觉分析任务(如目标识别、图像分类、语义分割等)中的性能,为后续任务奠定了坚实基础。在多模态任务不断涌现的背景下,研究者开始探索图像与文本的联合建模方法[17]。注意力机制(如交叉注意力)和联合嵌入空间(如对比语言-图像预训练(Contrastive Language-Image Pre-training,CLIP))的引入,实现了跨模态信息的有效交互和深度融合。基于Transformer的多模态架构(如从Transformer中学习跨模态编码器表示(Learning Cross-Modality Encoder Representations from Transformers,LXMERT))进一步推动了跨模态特征的动态对齐与协同工作,显著提升了多模态任务的性能。
近年来,ViT和自监督预训练模型(如简单的对比学习表示(Simple Contrastive Learning of Representations,SimCLR)、掩码自编码器(Masked Autoencoders,MAE))取得了突破性进展[18]。ViT通过引入Transformer架构,解决了图像特征的长距离依赖建模问题,而自监督学习减少了对标注数据的依赖,降低了数据成本。图神经网络(Graph Neural Network, GNN)[19]的引入则为图像中复杂物体关系的建模提供了新维度。图像去噪技术在计算机视觉领域取得了显著进展,小波变换和非局部均值方法作为两种经典方法,各自经历了重要的技术演进。在小波变换方面,研究者们主要从三个方向进行了改进。首先,针对传统固定小波基函数的局限性,Zhang等[20]提出了基于卷积神经网络的动态小波基学习方法,显著提升了基函数的自适应能力。其次,为了克服离散小波变换可能导致的伪吉布斯效应,Chen等[21]开发了融合小波域与空间域特征的创新方法。此外,Wang等[22]将小波变换与视觉Transformer相结合,通过引入小波引导的注意力机制,实现了更有效的特征提取。在非局部均值方法方面,研究进展主要体现在两个方面。Liu等[23]采用深度学习方法改进了传统的相似性度量方式,使用CNN提取的高层语义特征替代了原有的像素块匹配方法。
TALIBR, KASHIFM, AYESHAS, et al. Text Mining: Techniques, Applications and Issues[J]. Int J Adv Comput Sci Appl, 2016, 7(11): 414-418. DOI: 10.14569/ijacsa.2016.071153 .
[2]
XUP, ZHUX T, CLIFTOND A. Multimodal Learning with Transformers: A Survey[J]. IEEE Trans Pattern Anal Mach Intell, 2023, 45(10): 12113-12132.
[3]
LINJ, WANGY, XUY, et al. Semi-IIN: Semi-Supervised Intra-Inter Modal Interaction Learning Network for Multimodal Sentiment Analysis[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Philadelphia: AAAI Press, 2025: 1411-1419. DOI: 10.1609/aaai.v39i2.32131 .
[4]
GUOX T, YUW, WANGX D. An Overview on Fine-grained Text Sentiment Analysis: Survey and Challenges[J]. J Phys: Conf Ser, 2021, 1757(1): 012038. DOI: 10.1088/1742-6596/1757/1/012038 .
LIUJ, SONGH, CHEND P, et al. A Multimodal Sentiment Analysis Model Enhanced with Non-verbal Information and Contrastive Learning[J]. J Electron Inf Technol, 2024, 46(8): 3372-3381. DOI: 10.11999/JEIT231274 .
WANGN, WUF Y, ZHAOY X, et al. Image-text Sentiment Analysis Method Based on Ensemble Learning and Multimodal Large Language Model[J/OL]. Comput Eng Appl, 2025: 1-11. (2025-06-05).
[13]
MIKOLOVT, CHENK, CORRADOG, et al. Efficient Estimation of Word Representations in Vector Space[C]//Proceedings of the 1st International Conference on Learning Representations (ICLR 2013). Scottsdale, AZ, USA: ICLR, 2013.
[14]
DEVLINJ, CHANGM W, LEEK, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis: Association for Computational Linguistics, 2019: 4171-4186. DOI: 10.18653/v1/N19-1423 .
[15]
RAFFELC, SHAZEERN, ROBERTSA, et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer[J]. J Mach Learn Res, 2020, 21(140): 1-67.
[16]
WANGY, LIUJ, CHENZ, et al. Syntax-aware BERT with Graph Propagation for Dependency Parsing[C]//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Toronto, Canada: ACL, 2023: 123-135.
[17]
XUH, KOEHNP. Zero-Shot Cross-Lingual Dependency Parsing through Contextual Embedding Transformation[C]//Proceedings of the Second Workshop on Domain Adaptation for NLP. Kyiv: Association for Computational Linguistics, 2021: 204-213. DOI: 10.18653/v1/2021.adaptnlp-1.21 .
[18]
ZHANGR, WANGL, SUNH, et al. Dependency-driven Semantic Role Labeling with Heterogeneous Graph Networks[C]//Proceedings of the 37th AAAI Conference on Artificial Intelligence. Washington, USA: AAAI Press, 2023: 4567-4575. DOI: 10.1609/aaai.v37i11.26520 .
[19]
PLESTEDJ, GEDEONT. Deep Transfer Learning for Image Classification: A Survey[EB/OL]. (2022-05-20) [2025-08-27].
[20]
LOWED G. SIFT: A Retrospective on the Scale-Invariant Feature Transform[J]. J Comput Vis Image Process, 2022, 15(3): 45-60.
[21]
YANGD, LIX H, LIZ, et al. Prompt Fusion Interaction Transformer for Aspect-based Multimodal Sentiment Analysis[C]//2024 IEEE International Conference on Multimedia and Expo (ICME). New York: IEEE, 2024: 1-6. DOI: 10.1109/ICME57554.2024.10687885 .
[22]
CALDERON-RAMIREZS, YANGS X, ELIZONDOD. Semisupervised Deep Learning for Image Classification with Distribution Mismatch: a Survey[J]. IEEE Trans Artif Intell, 2022, 3(6): 1015-1029. DOI: 10.1109/TAI.2022.3196326 .
[23]
JIE Z, GANQUC, SHENGDINGH, et al. Graph Neural Networks: A Review of Methods and Applications[J].AI Open, 2020, 157-81. DOI: 10.1016/J.AIOPEN.2021.01.01 .
[24]
ZHANGY, LIX, WANGQ. Dynamic Wavelet Learning for Image Denoising Based on Convolutional Neural Networks[J]. IEEE Trans Image Process, 2022, 31: 4567-4580. DOI: 10.1109/TIP.2022.3185597 .
[25]
CHENH, LIUZ, SUNT. Wavelet-Spatial Domain Feature Fusion for Image Restoration Using Residual Learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver: IEEE, 2023: 12345-12354. DOI: 10.1109/CVPR52729.2023.01191 .
[26]
WANGL, ZHOUY, ZHANGR. Wavelet-Guided Attention Mechanism in Vision Transformers for Image Enhancement[J]. IEEE Trans Multimedia, 2023, 25(3): 102-115. DOI: 10.1109/TMM.2022.3205024 .
[27]
LIUZ, WANGQ, ZHANGY. Deep Feature Similarity for Non-Local Means Image Denoising[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 12345-12354.
[28]
GEETHAR, THILAGAMT, PADMAVATHYT. Retraction Note: Effective Offline Handwritten Text Recognition Model Based on a Sequence-to-sequence Approach with CNN-RNN Networks[J]. Neural Comput Appl, 2024, 36(24): 15227. DOI: 10.1007/s00521-024-10104-6 .
[29]
BALTRUSAITIST, AHUJAC, MORENCYL P. Multimodal Machine Learning: A Survey and Taxonomy[J]. IEEE Trans Pattern Anal Mach Intell, 2019, 41(2): 423-443. DOI: 10.1109/tpami.2018.2798607 .
[30]
CAIY J, LIX G, ZHANGY Y, et al. Multimodal Sentiment Analysis Based on Multi-layer Feature Fusion and Multi-task Learning[J]. Sci Rep, 2025, 15(1): 2126. DOI: 10.1038/s41598-025-85859-6 .
[31]
XUM Q, MAQ T, ZHANGH J, et al. MEF-UNet: an End-to-end Ultrasound Image Segmentation Algorithm Based on Multi-scale Feature Extraction and Fusion[J]. Comput Med Imag Graph, 2024, 114: 102370. DOI: 10.1016/j.compmedimag.2024.102370 .
[32]
WANGA, CHENH, LINZ J, et al. Rep ViT: Revisiting Mobile CNN from ViT Perspective[C]//2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2024: 15909-15920. DOI: 10.1109/CVPR52733.2024.01506 .
[33]
MIAHM S U, KABIRM M, SARWART B, et al. A Multimodal Approach to Cross-lingual Sentiment Analysis with Ensemble of Transformer and LLM[J]. Sci Rep, 2024, 14(1): 9603. DOI: 10.1038/s41598-024-60210-7 .
[34]
YUJ F, JIANGJ. Adapting Bert for Target-oriented Multimodal Sentiment Classification[C]//Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019. Macao, China: Ijcai. 2019: 5408-5414.
[35]
BOLYAD, FUC Y, DAIX L, et al. Hydra Attention: Efficient Attention with Many Heads[M]//Computer Vision-ECCV 2022 Workshops. Cham: Springer Nature Switzerland, 2023: 35-49. DOI: 10.1007/978-3-031-25082-8_3 .
[36]
ZHANGQ, FUJ, LIUX, et al. Adaptive Co-Attention Network for Named Entity Recognition in Tweets[C]//Proceedings of the 32nd AAAI Conference on Artificial Intelligence. New Orleans, USA: AAAI Press, 2018: 5674-5681. DOI: 10.1609/aaai.v32i1.11962 .
[37]
WANGB, LUW. Learning Latent Opinions for Aspect-Level Sentiment Classification[C]//Proceedings of the 32nd AAAI Conference on Artificial Intelligence. New Orleans, USA: AAAI Press, 2018: 5537-5544.