To address the problem of poor generalization performance of end-to-end speaker diarization systems due to insufficient labeled data, a cross-domain speaker diarization method based on domain-adversarial neural network adaptation is proposed. Firstly, a data domain discrimination model containing a temporal pooling layer is added to the speaker diarization. Secondly, the gradient reversal layer is used to realize the adversarial training of the speaker diarization classification task and the data domain discrimination task. Finally, the adaptation in the data domain is completed. Experiments are carried out to compare the performance of different models in real-world scenarios, demonstrating that the overall performance of the proposed method outperforms that of other models. Compared with the baseline model, when the data domain does not match, the relative improvement is 4.91% for the two-speaker scenarios and 5.41% for the three-speaker scenarios. When the data domain is matched, the relative increase is 3.81% and 5.14%, respectively. Experimental results indicate that the proposed method can effectively enhance the system’s cross-domain generalization ability by reducing the sensitivity of features to domain information.
SHUMS H, DEHAKN, DEHAKR, et al. Unsupervised methods for speaker diarization: an integrated and iterative approach[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2013,21(10):2015-2028.
[2]
SNYDERD, GARCIA-ROMEROD, SELLG, et al. X-vectors: robust DNN embeddings for speaker recognition[C]∥Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway, USA: IEEE, 2018:5329-5333.
[3]
DEHAKN, KENNYP J, DEHAKR, et al. Front-end factor analysis for speaker verification[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2011,19(4):788-798.
[4]
FUJITAY, KANDAN, HORIGUCHIS, et al. End-to-end neural speaker diarization with self-attention[C]∥Proceedings of the 2019 IEEE Automatic Speech Recognition and Understanding Workshop. Piscataway, USA: IEEE, 2019:296-303.
[5]
MEDENNIKOVI, KORENEVSKYM, PRISYACHT, et al. Target-speaker voice activity detection: a novel approach for multi-speaker diarization in a dinner party scenario[DB/OL]. (2020-10-25)[2025-03-12].
[6]
JEOUNGY R, CHOIJ H, SEONGJ S, et al. Self-distillation into self-attention heads for improving transformer-based end-to-end neural speaker diarization[DB/OL]. (2023-08-20)[2025-03-12].
WATANABES, HAYASHIA, KAMIYAY, et al. CHiME-6 challenge: tackling multispeaker speech recognition for unsegmented recordings[DB/OL]. (2020-05-04)[2025-03-12].
[9]
RAJ D, DENISOVP, CHENZ, et al. Integration of speech separation, diarization, and recognition for multi-speaker meetings: system description, comparison, and analysis[C]∥Proceedings of the 2021 IEEE Spoken Language Technology Workshop. Piscataway, USA: IEEE, 2021:897-904.
[10]
LUU C, BELLP, RENALSS. Leveraging speaker attribute information using multi task learning for speaker verification and diarization[DB/OL]. (2021-08-30)[2025-03-12].
[11]
ZHANGZ Z, CHENC, CHENH H, et al. Noise-aware speech separation with contrastive learning[C]∥Proceedings of the 2024 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway, USA: IEEE, 2024:1381-1385.
[12]
CORIAJ M, BREDINH, GHANNAYS, et al. Continual self-supervised domain adaptation for end-to-end speaker diarization[C]∥Proceedings of the 2022 IEEE Spoken Language Technology Workshop. Piscataway, USA: IEEE, 2023:626-632.
[13]
LUU C, BELLP, RENALSS. Channel adversarial training for speaker verification and diarization[C]∥Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway, USA: IEEE, 2020:7094-7098.
[14]
YANGG B, HEM K, NIUS T, et al. Neural speaker diarization using memory-aware multi-speaker embedding with sequence-to-sequence architecture[C]∥Proceedings of the 2024 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway, USA: IEEE, 2024:11626-11630.
[15]
TAWARAN, DELCROIXM, ANDOA, et al. NTT speaker diarization system for chime-7: multi-domain, multi-microphone end-to-end and vector clustering diarization[C]∥Proceedings of the 2024 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway, USA: IEEE, 2024:11281-11285.
[16]
PATURIR, SRINIVASANS, LIX. Lexical speaker error correction: leveraging language models for speaker diarization error correction[DB/OL]. (2023-08-20)[2025-03-12].
[17]
POLOKA, KLEMENTD, KOCOURM, et al. DiCoW: diarization-conditioned Whisper for target speaker automatic speech recognition[J]. Computer Speech & Language, 2026,95:No.101841.
[18]
JIAOX L, CHENY Q, QUD, et al. Blueprint separable subsampling and aggregate feature conformer-based end-to-end neural diarization[J]. Electronics, 2023,12(19):No.4118.
[19]
SNYDERD, CHENG, POVEYD. MUSAN: a music, speech, and noise corpus[DB/OL]. (2015-10-28)[2025-03-12].
[20]
KO T, PEDDINTIV, POVEYD, et al. A study on data augmentation of reverberant speech for robust speech recognition[C]∥Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway, USA: IEEE, 2017:5220-5224.
[21]
FUJITAY, KANDAN, HORIGUCHIS, et al. End-to-end neural speaker diarization with permutation-free objectives[DB/OL]. (2019-09-15)[2025-03-12].
[22]
LANDINIF, LOZANO-DIEZA, DIEZM, et al. From simulated mixtures to simulated conversations as training data for end-to-end neural diarization[DB/OL]. (2022-09-18)[2025-03-12].
[23]
KINGMAD P, BAJ L. Adam: a method for stochastic optimization[C]∥Proceedings of the 3rd International Conference on Learning Representations. 2015. DOI: 10.48550/arXiv.1412.6980 .
[24]
TANGY, DINGG H, HUANGJ, et al. Deep speaker embedding learning with multi-level pooling for text-independent speaker verification[C]∥Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway, USA: IEEE, 2019:6116-6120.
[25]
HORIGUCHIS, FUJITAY, WATANABES, et al. End-to-end speaker diarization for an unknown number of speakers with encoder-decoder based attractors[DB/OL]. (2020-10-25)[2025-03-12].
[26]
HOWARDA G, ZHUM L, CHENB, et al. MobileNets: efficient convolutional neural networks for mobile vision applications[DB/OL]. (2017-04-17)[2025-03-12].
[27]
HORIGUCHIS, FUJITAY, WATANABES, et al. Encoder-decoder based attractors for end-to-end neural diarization[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2022,30:1493-1507.
[28]
ZHAOM, MAY F, DINGY W, et al. Multi-query multi-head attention pooling and inter-topk penalty for speaker verification[C]∥Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway, USA: IEEE, 2022:6737-6741.