To address the issue of limited generalization capability of G2P models in the low-resource language environment of Mongolian, this study introduced the Transformer architecture along with pre-trained grapheme-based models into the field of Mongolian grapheme-to-phoneme (G2P) conversion. By constructing a Mongolian grapheme-phoneme alignment corpus, the study explored the impact mechanisms of encoder layer numbers and feed-forward network dimensions on the performance of two types of models. Experimental results show that in the field of Mongolian G2P, the Transformer model reduces the Word Error Rate (WER) from 16.3% in the traditional n-gram baseline model to 13.64%. The GBERT attention model further lowers the WER to 12.84%. The significance of this study is as follows: (1) It is the first time that Transformer and pre-trained attention mechanism models have been applied to the Mongolian G2P task. (2) A Mongolian grapheme-phoneme alignment corpus is constructed, providing data support for low-resource Mongolian language research. (3) The impact patterns of model hyperparameters and regularization strategies on performance are quantified, establishing reproducible experimental benchmarks. The research findings offer dual references for theoretical methods and engineering practices in G2P tasks for Mongolian and other morphologically complex languages.
MELETISD. The grapheme as a universal basic unit of writing[J]. Writing Systems Research, 2019, 11 (1): 26-49.
[2]
邢福义,吴振国.语言学概论[M].武汉:华中师范大学出版社,2002:77-87.
[3]
金秀丽.端到端语音识别算法研究与实现[D].兰州:兰州交通大学,2023.
[4]
ELOVITZH, JOHNSONR, MCHUGHA, et al. Letter-to-sound rules for automatic translation of English text to phonetics[J]. IEEE Transactions on Acoustics, Speech, and Signal Processing, 1976, 24(6): 446-459.
[5]
DAMPERR I, MARCHANDY, ADAMSONM J, et al. Evaluating the pronunciation component of text-to-speech systems for English: A performance comparison of different approaches[J]. Computer Speech & Language, 1999, 13(2): 155-176.
[6]
CHENGS Y, ZHUP C, LIUJ T, et al. A survey of grapheme-to-phoneme conversion methods[J]. Applied Sciences, 2024, 14(24): 11790: 1-20.
[7]
HÄKKINENJ, SUONTAUSTAJ, RIISS, et al. Assessing text-to-phoneme mapping strategies in speaker independent isolated word recognition[J]. Speech Communication, 2003, 41(2-3): 455-467.
[8]
RAOK, PENGF C, SAK H, et al. Grapheme-to-phoneme conversion using Long Short-Term Memory recurrent neural networks[C]//2015 IEEE International Conference on Acoustics, Speech and Signal Proces sing (ICASSP). South Brisbane: IEEE, 2015: 4225-4229.
YOLCHUYEVAS, NÉMETHG, GYIRES-TÓTHB. Transformer based grapheme-to-phoneme conversion [C]// Proceedings of the 20th Annual Conference of the International Speech Communication Association. Graz: ISCA, 2019: 2095-2099.
[11]
DONGL, GUOZ Q, TANC H, et al. Neural grapheme-to-phoneme conversion with pre-trained grapheme models[C]// 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Singapore: IEEE, 2022: 6202-6206.
[12]
NOVAKJ R, MINEMATSUN, HIROSEK. Phonetisaurus: Exploring grapheme-to-phoneme conversion with joint n-gram models in the WFST framework[J]. Natural Language Engineering, 2016, 22(6): 907-938.
[13]
QHARABAGHM F, DEHGHANIANZ, RABIEEH R. LLM-powered grapheme-to-phoneme conversion: Benchmark and case study[C]//2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Hyderabad: IEEE, 2025: 1-5.
LIUZ N, BAOF L, GAOG L, et al. Mongolian grapheme to phoneme conversion by using hybrid approach [M]// ZHANG M, NG V, ZHAO D, et al. Natural Language Processing and Chinese Computing. Cham: Springer, 2018: 40-50.
[16]
VASWANIA, SHAZEERN, PARMARN, et al. Attention is all you need [C]// WALLACH H M, LAROCHELLE H, BEYGELZIMER A, et al. Advances in Neural Information Processing Systems 30 (NIPS 2017). Red Hook: Curran Associates Inc., 2017: 5998-6008.
[17]
ASHBYL F E, BARTLEYT M, CLEMATIDES, et al. Results of the second SIGMORPHON shared task on multilingual grapheme-to-phoneme conversion [C]// Proceedings of the 18th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology. Stroudsburg: Association for Computational Linguistics, 2021: 115-125.
[18]
DEVLINJ, CHANGM W, LEEK, et al. BERT: Pre-training of deep bidirectional transformers for language understanding [C]// BURSTEIN J, DORAN C, SOLORIO T. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2019). Stroudsburg: Association for Computational Linguistics, 2019: 4171-4186.
[19]
ZHUJ, XIAY, WUL, et al. Incorporating BERT into neural machine translation [C]// 2020 International Conference on Learning Representations (ICLR). Addis Ababa: Open Review Net, 2020: 1-17.
SRIVASTAVAN, HINTONG, KRIZHEVSKYA. Dropout: A simple way to prevent neural networks from overfitting[J]. Journal of Machine Learning Research, 2014, 15(56): 1929-1958.
[22]
HUANGL, ZHAOK, MAM. When to finish? Optimal beam search for neural text generation (modulo beam size)[D]. Ithaca: Cornell University, 2018.