Current multimodal sentiment analysis primarily relies on complex techniques for fusing multimodal features. However, due to the significant distribution differences among various modal features, direct fusion yields poor results. To address this issue, this paper proposes an interactive learning network model that integrates a multi-subspace framework and channel attention. Firstly, a hybrid neural network is utilized to extract features from each modality, and a stacked bidirectional long short-term memory network is employed to represent the utterance sequence at the linguistic level. Fixed-size utterance vectors are mapped into two different representations: modal-invariant and modal-specific, with the latter undergoing bimodal interaction using a temporal convolutional network. Subsequently, channel attention is leveraged to extract more meaningful information, and a cross-modal interactive bidirectional gated recurrent neural network and a bimodal interactive attention mechanism are proposed for deeper interaction among the extracted modal-invariant representation vectors. Loss optimization is then performed using a loss function. Finally, a multi-head attention mechanism based on Transformer is executed to obtain a joint vector, and a fully connected layer is utilized to predict the final result. Experiments conducted on the CMU-MOSI and CMU-MOSEI datasets demonstrate that this method can effectively eliminate multimodal differences and achieve multimodal fusion.
早期的情感分析研究主要是基于文本的,然而在当今信息爆炸的时代,通过各种数字平台,人们表达情感的方式日益多样化,从传统的纯文本形式扩展到了包含图像、视频、音频在内的多模态形式.这种多模态表达不仅丰富了情感交流的层次和深度,也为情感分析带来了前所未有的挑战与机遇.情感分析,作为自然语言处理(natural language processing, NLP)和多媒体处理交叉领域的一个重要研究方向,旨在自动识别和量化文本、图像、声音等媒介中表达的情感倾向,在理解用户情绪、预测社会趋势、优化用户体验等方面具有广泛的应用价值.
对视频中的语音,本文使用COVAREP(collaborative voice analysis repository for speech technologies)[14],COVAREP是一种声学分析框架,它提取的特征有12梅尔频率倒谱系数、声门源参数等,得到特征向量,MOSI和MOSEI数据集的特征维度为74.
2.2 特征提取及问题定义
在视觉特征提取中,卷积模型(如CNN)是常用的技术之一.传统的CNN主要借助较深的网络结构和较大的卷积核,这不仅可能导致梯度消失或爆炸现象,还会增加计算负担.为了应对这些问题,本文引入了VGG16(visual geometry group 16)模型.该模型采用三个3×3的卷积核等效替代一个7×7的卷积核,以及用两个3×3的卷积核等效替代一个5×5的卷积核.这样做的目的是在不改变感受野大小的基础上增加网络的层数,通过叠加更多的卷积层来增强模型的非线性映射能力,从而提升网络的整体性能.语音特征由TCN网络提取,文本特征由Bi-GRU网络提取.
本文在构建模型时,利用了CBAM中的通道注意力机制CAM.该机制首先对输入特征进行基于空间的全局最大池化(global max pooling)和全局平均池化(global average pooling);然后将处理后的特征输入到一个共享的多层感知机(multi-layer perception,MLP)中学习,该网络使用了ReLU激活函数;最后,将多层感知机输出的特征相加,并通过Sigmoid激活函数进行激活处理,生成最终的通道注意力特征.CAM模型结构如图2所示,计算公式为:
为了减小不同模态数据间的差异,MSF-CA模型采用同一编码器来学习多种模态的特征,并使用相似损失函数度量两个特征向量之间的相似性,其目标是在训练过程中最小化该损失.此处应用了中心距差异(central moment discrepancy,CMD),CMD是令和为有界随机样本,其概率分布为和在区间上,中心距差异正则化器被定义为CMD度量的经验估计,.CMD是一种先进的距离度量方法,通过匹配两种表示的顺序矩差来衡量它们之间分布的差异.计算公式如下:
首先,本文选择CMD而不是最大平均差异(maximum mean discrepancy,MMD)或KL散度(kullback-leibler divergence)的原因是CMD不仅是一种流行的度量[16],而且它可以执行高阶矩的显式匹配且无需复杂的距离和核矩阵计算.其次,尽管对抗损失(adversarial loss)提供了另一种相似性训练方案,但其鉴别器和共享编码器参与极大极小博弈,这会增加额外的参数和复杂度,因此选择运算简单的CMD.
PORIAS, CAMBRIAE, HAZARIKAD,et al.Context-dependent sentiment analysis in user-generated videos[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Thesiss). Vancouver, Canada:ACL, 2017:873-883.
[2]
CHAUHAND S, AKHTARM S, EKBALA,et al.Context-aware interactive attention for multi-modal sentiment and emotion analysis[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP).Hong Kong, China:ACL, 2019:5647-5657.
LIW X, GANC Q. Hierarchical interactive fusion multimodal sentiment analysis based on attention mechanism[J].Journal of Chongqing University of Posts and Telecommunications (Natural Science Edition), 2023,35(1):176-184.
[7]
YANGJ, YUY, NIUD,et al.ConFEDE:contrastive feature decomposition for multimodal sentiment analysis[C]//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Thesis).Toronto, Canada:ACL, 2023:7617-7630.
[8]
WANGY, SHENY, LIUZ,et al.Words can shift:dynamically adjusting word representations using nonverbal behaviors[C]//The AAAI Conference on Artificial Intelligence.Hawaii, USA:AAAI, 2019,33(1):7216-7223.
[9]
GHOSALD, AKHTARM S, CHAUHAND,et al.Contextual inter-modal attention for multi-modal sentiment analysis[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.Brussels, Belgium:EMNLP, 2018:3454-3466.
XIC, LUG, YANJ.Multimodal sentiment analysis based on multi-head attention mechanism[C]//Proceedings of the 4th International Conference on Machine Learning and Soft Computing.Hai Phong,Vietnam:ICMLSC, 2020:34-39.
[12]
KUMARA, VEPAJ. Gated mechanism for attention based multi modal sentiment analysis[C]//ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).New York, USA:IEEE, 2020:4477-4481.
[13]
DELBROUCKJ B, TITSN, BROUSMICHEM, et al. A transformer-based joint-encoding for emotion recogni-tion and sentiment analysis [DB/OL]. (2020-06-29) [2025-10-21].
WIBOWOH, FIRDAUSIF, SUHARSOW, et al. Facial expression recognition of 3D image using facial action coding system (FACS)[J/OL].TELKOMNIKA (Telecommunication Computing Electronics and Control), 2019, 17(2): 628-636.
[16]
VERDEL, MARULLIF, DE FAZIOR,et al.HEAR set:a lightweight acoustic parameters set to assess mental health from voice analysis[J/OL].Computers in Biology and Medicine,2024,182 [2025-10-21].
[17]
WOO S, PARKJ, LEEJ Y,et al.Cbam:convolutional block attention module [C]//Proceedings of the European conference on computer vision (ECCV).Munich, Germany:ECCV,2018:3-19.
[18]
MENGZ, CAOW, SUND,et al.Research on fault diagnosis method of MS-CNN rolling bearing based on local central moment discrepancy[J/OL]. Advanced Engineering Informatics, 2022,54 [2025-10-21].
[19]
SUNZ, SARMAP, SETHARESW,et al.Learning relationships between text, audio,and video via deep canonical correlation for multimodal language analysis[C]//Proceedings of the AAAI Conference on Artificial Intelligence.New York, USA:AAAI,2020,34(5):8992-8999.
[20]
HANW, CHENH, GELBUKHA, et al.Bi-bimodal modality fusion for correlation-controlled multimodal sentiment analysis[C]//Proceedings of the 2021 International Conference on Multimodal Interaction.New York, USA:ACM,2021:6-15.
[21]
SUNH, WANGH, LIUJ,et al.CubeMLP:an MLP-based model for multimoda-l sentiment analysis and depression estimation[C]//Proceedings of the 30th ACM International Conference on Multimedia.Los Angeles, USA:ACM,2022:3722-3729.
[22]
HAZARIKAD, ZIMMERMANNR, PORIAS. Misa: modality-invariant and specific representations for multimodal sentiment analysis[C]//Proceedings of the 28th ACM international conference on multimedia.New York, USA:ACM,2020: 1122-1131.
HUX R, CHENZ H, LIUJ P,et al. Sentiment analysis framework based on multimodal representation learning[J]. Computer Science, 2022,49(S2): 631-636.(Ch).
[25]
MAI S, ZENGY, ZHENGS, et al. Hybrid contrastive learning of tri-modal representation for multimodal sentiment analysis[J]. IEEE Transactions on Affective Computing, 2022, 14(3): 2276-2289.
[26]
ZHANGQ, SHIL, LIUP, et al. RETRACTED ARTICLE: ICDN: integrating consistency and difference networks by transformer for multimodal sentiment analysis[J]. Applied Intelligence, 2023, 53(12):16332-16345.
[27]
LINH, ZHANGP, LINGJ, et al. PS-mixer: a polar-vector and strength-vector mixer model for multimodal sentiment analysis[J/OL]. Information Processing & Management, 2023, 60 [2025-10-21].