College of Electronic Information and Optical Engineering, Taiyuan University of Technology, Taiyuan 030024, China
Show less
文章历史+
Received
Accepted
Published
2024-07-10
Issue Date
2026-05-13
PDF (2011K)
摘要
为了改善基于语音的抑郁症检测研究中存在的特征提取繁杂、数据扩充方式较为单一及回归预测时预测偏差不可控的问题,本文从特征构建、数据增强以及网络架构3个方面进行改进,提出一种采用双路深度卷积生成对抗网络(dual path deep convolution generative adversarial network,DP‒DCGAN)进行数据增强和分类‒回归网络的抑郁检测模型。特征构建部分用于构建具有时频特性及线性、非线性特性的二维特征图以充分表征抑郁症。数据增强部分提出DP‒DCGAN网络进行数据增强,增加特征图的多样性以提高模型的鲁棒性及泛化性;并基于空间和频域特性两方面提出评价指标对生成特征图进行筛选,保留高质量的生成特征图。网络架构部分提出分类‒回归网络,通过减小预测置信区间降低预测偏差;针对分类网络中的残差网络,引入多尺度卷积以增强特征间的信息交互,使残差网络充分感知特征图中所蕴含的多层次信息。采用准确率(accuracy)、均方根误差(RMSE)和平均绝对误差(MAE)作为分类‒回归网络的评价指标,最终在AVEC2014数据集上,本文模型的四分类准确率达到了94.73%,RMSE和MAE分别为4.55和1.11,与现有的抑郁症检测模型相比具有明显优势。
Abstract
Objective Accurate evaluation of depression scores in patients with depression provides effective support for clinical auxiliary diagnosis and enables the development of personalized diagnosis and treatment plans, improving the overall accuracy of clinical diagnosis and intervention and contributing significantly to patient health outcomes. Existing research on voice-based depression detection exhibits several limitations, including complex feature extraction processes, single-mode data augmentation, and uncontrollable prediction bias in regression-based estimation. This study proposes a dual‒path DCGAN for data generation and introduces a classification-regression network model for depression score prediction, enabling effective auxiliary diagnosis of depression severity. Methods Firstly, based on the audio characteristics of depressed patients, six types of emotional features were selected from existing speech features, and corresponding two-dimensional feature maps were constructed for each audio signal sample. For MFCC features, the Teager energy operator was fused with MFCC to form MFCC‒TEO features, which further highlighted differences in energy distribution. In addition, the dual‒path deep convolutional generative adversarial network proposed in this study was utilized to enhance the two-dimensional feature maps of each depression level to expand the dataset, increase feature map diversity, and improve model robustness and generalization. Simultaneously, an evaluation index based on spatial and frequency domain characteristics was proposed to screen generated feature maps and retain high-quality samples. Finally, a classification regression network was introduced into the prediction framework to reduce prediction bias by narrowing the prediction confidence interval. For residual networks within the classification framework, multi-scale convolution was introduced to enhance information interaction among features, which enabled the residual network to fully perceive multi-level information contained in the feature maps. Results and Discussions Feature validity tests were conducted for the six selected emotional features, in which MFCC, MFCC‒TEO, LPCC, and Jitter features were sequentially added based on short-term energy, zero-crossing rate, and sound intensity, and accuracy (Acc), root mean square error (RMSE), and mean absolute error (MAE) under different input configurations were calculated. Experimental results showed that Acc, RMSE, and MAE were 89.76%, 6.17, and 2.08, respectively, when MFCC was added. When MFCC‒TEO was added, Acc, RMSE, and MAE reached 92.07%, 5.49, and 1.58, respectively. When MFCC‒TEO and LPCC were added, Acc, RMSE, and MAE further improved to 93.41%, 5.09, and 1.39, respectively. When MFCC‒TEO, LPCC, and Jitter were added, Acc, RMSE, and MAE reached 94.73%, 4.55, and 1.11, respectively. These results demonstrated that when MFCC‒TEO was used as an input feature, Acc increased by 2.31 percentage points, while RMSE and MAE decreased by 0.68 and 0.50, respectively, compared to using MFCC alone, which indicated that combining MFCC with TEO enhanced the representation of energy distribution differences. The MFCC‒TEO coefficient exhibited stronger depression characterization capability than the MFCC coefficient. Subsequent incorporation of LPCC and Jitter features further improved prediction accuracy to a certain extent. In the data enhancement experiments, when the original dataset was utilized to predict depression scores, Acc, RMSE, and MAE were 80.51%, 8.47, and 3.94, respectively. After data enhancement using the dual deep convolutional generative adversarial network, Acc, RMSE, and MAE improved to 94.73%, 4.55, and 1.11, respectively. Compared to the original dataset, prediction accuracy significantly improved, with Acc increasing by 14.22 percentage points, and RMSE and MAE decreasing by 3.92 and 2.83, respectively, which demonstrated that DP‒DCGAN-based data enhancement effectively expanded the dataset. In the prediction network, the classification accuracy of the original ResNet was 93.28%, while the MSC‒ResNet achieved a classification accuracy of 94.73%, representing an improvement of 1.45 percentage points. These results confirmed that the multi-scale convolution strategy extracted richer global and contextual information, after which the residual network captured detailed information, enabling the network to fully perceive multi-scale characteristics within the input feature maps and ultimately improve overall model performance. Conclusions This study proposes a depression diagnosis model based on a deep generation network and a classification regression framework. The MFCC‒TEO feature is obtained by introducing TEO into MFCC, and six features, including TEO, are extracted to construct a two-dimensional feature map incorporating time‒frequency, linear, and nonlinear properties. Feature maps corresponding to each depression score in the original dataset are enhanced to increase feature diversity, and evaluation indicators are proposed to screen high-quality feature maps from both spatial and frequency domain perspectives by constructing a DP‒DCGAN network. High-quality and diversified feature maps significantly improve the overall performance of the model. Finally, the proposed MRVN classification regression network is applied to predict depression scores. A multi-scale convolution module is added to the ResNet classification network to address the limitation of single-scale receptive fields in feature extraction by integrating the unique characteristics of the feature maps proposed in this study. In addition, the input data can be predicted on a more uniform scale by combining classification and regression strategies, reducing large prediction deviations commonly observed in regression tasks.
Institute for Health Metrics and Evaluation. Global burden of disease(GBD)2021[DB/OL].Seattle:IHME,2024 [2026‒03‒04].
[2]
MathersC D, LoncarD.Projections of global mortality and burden of disease from 2002 to 2030[J].PLoS Medicine,2006,3(11):e442. doi:10.1371/journal.pmed.0030442
HeLang, CaoCui.Automated depression analysis using convolutional neural networks from speech[J].Journal of Biomedical Informatics,2018,83:103‒111. doi:10.1016/j.jbi.2018.05.007
[5]
ValstarM, SchullerB, SmithK,et al.AVEC 2014:3D dimensional affect and depression recognition challenge[C]//Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge.Orlando:ACM,2014:3‒10. doi:10.1145/2661806.2661807
[6]
WangCongcong, LiuDecheng, TaoKemeng,et al.A multi-modal feature layer fusion model for assessment of depression based on attention mechanisms[C]//Proceedings of the 2022 15th International Congress on Image and Signal Processing,BioMedical Engineering and Informatics (CISP‒BMEI).Beijing:IEEE,2022:1‒6. doi:10.1109/cisp-bmei56279.2022.9979894
[7]
YangLe, JiangDongmei, HanWenjing,et al.DCNN and DNN based multi-modal depression recognition[C]//Proceedings of the 2017 Seventh International Conference on Affective Computing and Intelligent Interaction(ACII).San Antonio:IEEE,2017:484‒489. doi:10.1109/acii.2017.8273643
[8]
LiuZhenyu.Research on method and key technology for depression recognition based on speech[D].Lanzhou:Lanzhou University,2017.
[9]
刘振宇.基于语音的抑郁识别方法及关键技术研究[D].兰州:兰州大学,2017.
[10]
TianHan, ZhuZhang, JingXu.Deep learning for depression recognition from speech[J].Mobile Networks and Applications,2024,29(4):1212‒1227. doi:10.1007/s11036-022-02086-3
[11]
LuXiaoyong, ShiDaimin, LiuYang,et al.Speech depression recognition based on attentional residual network[J].Frontiers in Bioscience,2021,26(12):1746‒1759. doi:10.52586/5066
[12]
IshimaruM, OkadaY, UchiyamaR,et al.Classification of depression and its severity based on multiple audio features using a graphical convolutional neural network[J].International Journal of Environmental Research and Public Health,2023,20(2):1588. doi:10.3390/ijerph20021588
[13]
SardariS, NakisaB, RastgooM N,et al.Audio based depression detection using convolutional autoencoder[J].Expert Systems with Applications,2022,189:116076. doi:10.1016/j.eswa.2021.116076
[14]
PanYuchen, ShangYuanyuan, WangWei,et al.Multi-feature deep supervised voiceprint adversarial network for depression recognition from speech[J].Biomedical Signal Processing and Control,2024,89:105704. doi:10.1016/j.bspc.2023.105704
[15]
CumminsN, SchererS, KrajewskiJ,et al.A review of depression and suicide risk assessment using speech analysis[J].Speech Communication,2015,71:10‒49. doi:10.1016/j.specom.2015.03.004
[16]
PampouchidouA, SimosP G, MariasK,et al.Automatic assessment of depression based on visual cues:A systematic review[J].IEEE Transactions on Affective Computing,2019,10(4):445‒470. doi:10.1109/taffc.2017.2724035
SenoussaouiM, Sarria‒PajaM, SantosJ F,et al.Model fusion for multimodal depression classification and level detection[C]//Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge.Orlando:ACM,2014:57‒63. doi:10.1145/2661806.2661819
[19]
DongYizhuo, YangXinyu.A hierarchical depression detection model based on vocal and emotional cues[J].Neurocomputing,2021,441:279‒290. doi:10.1016/j.neucom.2021.02.019
[20]
MuzammelM, SalamH, OthmaniA.End-to-end multimodal clinical depression recognition using deep neural networks:A comparative analysis[J].Computer Methods and Programs in Biomedicine,2021,211:106433. doi:10.1016/j.cmpb.2021.106433
[21]
RejaibiE, KomatyA, MeriaudeauF,et al.MFCC-based Recurrent Neural Network for automatic clinical depression recognition and assessment from speech[J].Biomedical Signal Processing and Control,2022,71:103107. doi:10.1016/j.bspc.2021.103107
[22]
GoodfellowI, Pouget‒AbadieJ, MirzaM,et al.Generative adversarial nets[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems-Volume 2. Cambridge:MIT Press,2014:2672‒2680. doi:10.3156/JSOFT.29.5_177_2
[23]
YangLe, JiangDongmei, SahliH.Feature augmenting networks for improving depression severity estimation from speech signals[J].IEEE Access,2020,8:24033‒24045. doi:10.1109/access.2020.2970496
[24]
ValstarM, GratchJ, SchullerB,et al.AVEC 2016:Depression,mood,and emotion recognition workshop and challenge[C]//Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge.Amsterdam:ACM,2016:3‒10. doi:10.1145/2988257.2988258
[25]
WangKunfeng, GouChao, DuanYanjie,et al.Generative adversarial networks:Introduction and outlook[J].IEEE/CAA Journal of Automatica Sinica,2017,4(4):588‒598. doi:10.1109/jas.2017.7510583
[26]
KaiserJ F.On a simple algorithm to calculate the 'energy' of a signal[C]//Proceedings of the International Conference on Acoustics,Speech,and Signal Processing.Albuquerque:IEEE,2002:381‒384. doi:10.1109/icassp.2002.1005935
[27]
GaoHui, ChenShanguang, SuGuangchuan.Emotion classification of mandarin speech based on TEO nonlinear features[C]//Proceedings of the Eighth ACIS International Conference on Software Engineering,Artificial Intelligence,Networking,and Parallel/Distributed Computing (SNPD 2007).Qingdao:IEEE,2007:394‒398. doi:10.1109/snpd.2007.487
[28]
OzdasA, ShiaviR G, SilvermanS E,et al.Investigation of vocal jitter and glottal flow spectrum as possible cues for depression and near-term suicidal risk[J].IEEE Transactions on Biomedical Engineering,2004,51(9):1530‒1540. doi:10.1109/tbme.2004.827544
[29]
HanZhuojin, ShangYuanyuan, ShaoZhuhong,et al.Spatial-temporal feature network for speech-based depression recognition[J].IEEE Transactions on Cognitive and Developmental Systems,2024,16(1):308‒318. doi:10.1109/tcds.2023.3273614
[30]
SalimansT, GoodfellowI, ZarembaW,et al.Improved techniques for training GANs[EB/OL].(2016‒06‒10)[2024‒03‒18].doi:10.1109/wacv48630.2021.00134
[31]
HeuselM, RamsauerH, UnterthinerT,et al.GANs trained by a two time-scale update rule converge to a local Nash equilibrium[EB/OL].(2017‒06‒26)[2024‒03‒18].doi:10.48550/arXiv.1706.08500
[32]
BrinkA D.Thresholding of digital images using two-dimensional entropies[J].Pattern Recognition,1992,25(8):803‒808. doi:10.1016/0031-3203(92)90034-g
[33]
HeKaiming, ZhangXiangyu, RenShaoqing,et al.Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition(CVPR).Las Vegas:IEEE,2016:770‒778. doi:10.1109/cvpr.2016.90
[34]
SimonyanK, ZissermanA.Very deep convolutional networks for large-scale image recognition[EB/OL].(2014‒09‒04)[2024‒03‒18].doi:10.48550/arXiv.1409.1556
[35]
NiuMingyue, TaoJianhua, LiuBin,et al.Multimodal spatiotemporal representation for automatic depression level detection[J].IEEE Transactions on Affective Computing,2023,14(1):294‒307. doi:10.1109/taffc.2020.3031345
[36]
UddinM A, JooleeJ B, SohnK A.Deep multi-modal network based automated depression severity estimation[J].IEEE Transactions on Affective Computing,2023,14(3):2153‒2167. doi:10.1109/taffc.2022.3179478