基于预训练字素模型的蒙古语G2P研究

顺毅; 萨仁高娃; 董伟杰

doi:10.3969/j.issn.1001-8735.2026.02.011

内蒙古师范大学学报（自然科学版） ›› 2026, Vol. 55 ›› Issue (02) : 205 -213. DOI: 10.3969/j.issn.1001-8735.2026.02.011

基于预训练字素模型的蒙古语G2P研究

作者信息 +

G2P Research on Mongolian Based on Pre-trained Grapheme-to-Phoneme （G2P） Models

Author information +

文章历史 +

PDF (1021K)

摘要

针对蒙古文低资源语言环境下G2P模型泛化能力受限的问题，将Transformer架构与基于字素预训练的模型引入蒙古文字素音素转换领域。通过构建蒙古文字素-音素对齐语料库，探索编码器层数、前馈网络维度对两种模型性能的影响机制。实验结果表明，在蒙古文G2P领域，Transformer模型将WER值从传统n-gram基线模型的16.3%降至13.64%；GBERT注意力模型又进一步将WER值降至12.84%。研究意义在于：（1）首次将Transformer与预训练注意力机制的模型引入蒙古文G2P任务；（2）构建蒙古文字素-音素对齐语料库，为低资源蒙古语研究提供数据支撑；（3）量化模型超参数与正则化策略的性能影响规律，建立可复现的实验基准。研究成果为蒙古文及同类形态复杂语言的G2P任务提供了理论方法与工程实践的双重参考。

Abstract

To address the issue of limited generalization capability of G2P models in the low-resource language environment of Mongolian， this study introduced the Transformer architecture along with pre-trained grapheme-based models into the field of Mongolian grapheme-to-phoneme （G2P） conversion. By constructing a Mongolian grapheme-phoneme alignment corpus， the study explored the impact mechanisms of encoder layer numbers and feed-forward network dimensions on the performance of two types of models. Experimental results show that in the field of Mongolian G2P， the Transformer model reduces the Word Error Rate （WER） from 16.3% in the traditional n-gram baseline model to 13.64%. The GBERT attention model further lowers the WER to 12.84%. The significance of this study is as follows：（1） It is the first time that Transformer and pre-trained attention mechanism models have been applied to the Mongolian G2P task. （2） A Mongolian grapheme-phoneme alignment corpus is constructed， providing data support for low-resource Mongolian language research. （3） The impact patterns of model hyperparameters and regularization strategies on performance are quantified， establishing reproducible experimental benchmarks. The research findings offer dual references for theoretical methods and engineering practices in G2P tasks for Mongolian and other morphologically complex languages.

Graphical abstract

关键词

素音素转换 / 注意力机制 / 语音识别 / 语音合成

Key words

grapheme-to-phoneme conversion / attention mechanism / speech recognition / speech synthesis

引用本文

引用格式 ▾

[Author(id=1261746444387443073, tenantId=1045748351789510663, journalId=1206194551464398892, articleId=1261746443246592345, orderNo=0, firstName=null, middleName=null, lastName=null, nameCn=null, orcid=null, stid=null, country=null, authorPic=null, dead=0, email=null, emailSecond=null, emailThird=null, correspondingAuthor=0, authorType=1, ext={EN=AuthorExt(id=1261746444446163332, tenantId=1045748351789510663, journalId=1206194551464398892, articleId=1261746443246592345, authorId=1261746444387443073, language=EN, stringName=yi Shun, firstName=yi, middleName=null, lastName=Shun, prefix=null, suffix=null, authorComment=null, nameInitials=null, affiliation=null, department=null, xref=null, address=College of Computer Science and Technology，I nner Mongolia Normal University，Hohhot 010022，China, bio=null, bioImg=null, bioContent=null, aboutCorrespAuthor=null), CN=AuthorExt(id=1261746444492300678, tenantId=1045748351789510663, journalId=1206194551464398892, articleId=1261746443246592345, authorId=1261746444387443073, language=CN, stringName=顺毅, firstName=null, middleName=null, lastName=null, prefix=null, suffix=null, authorComment=null, nameInitials=null, affiliation=null, department=null, xref=null, address=内蒙古师范大学计算机科学技术学院，内蒙古呼和浩特 010022, bio={"content":"

顺毅（1998-），男，在读硕士研究生。

"}, bioImg=null, bioContent=

顺毅（1998-），男，在读硕士研究生。

, aboutCorrespAuthor=null)}, companyList=[AuthorCompany(id=1261746444303556987, tenantId=1045748351789510663, journalId=1206194551464398892, articleId=1261746443246592345, xref=null, ext=[AuthorCompanyExt(id=1261746444320334204, tenantId=1045748351789510663, journalId=1206194551464398892, articleId=1261746443246592345, companyId=1261746444303556987, language=EN, country=null, province=null, city=null, postcode=null, companyName=null, departmentName=null, remark=College of Computer Science and Technology，I nner Mongolia Normal University，Hohhot 010022，China), AuthorCompanyExt(id=1261746444341305726, tenantId=1045748351789510663, journalId=1206194551464398892, articleId=1261746443246592345, companyId=1261746444303556987, language=CN, country=null, province=null, city=null, postcode=null, companyName=null, departmentName=null, remark=内蒙古师范大学计算机科学技术学院，内蒙古呼和浩特 010022)])]), Author(id=1261746444538438024, tenantId=1045748351789510663, journalId=1206194551464398892, articleId=1261746443246592345, orderNo=1, firstName=null, middleName=null, lastName=null, nameCn=null, orcid=null, stid=null, country=null, authorPic=null, dead=0, email=ciecsrgw@imnu.edu.cn, emailSecond=null, emailThird=null, correspondingAuthor=1, authorType=1, ext={EN=AuthorExt(id=1261746444597158283, tenantId=1045748351789510663, journalId=1206194551464398892, articleId=1261746443246592345, authorId=1261746444538438024, language=EN, stringName=gaowa Saren, firstName=gaowa, middleName=null, lastName=Saren, prefix=null, suffix=null, authorComment=null, nameInitials=null, affiliation=null, department=null, xref=null, address=College of Computer Science and Technology，I nner Mongolia Normal University，Hohhot 010022，China, bio=null, bioImg=null, bioContent=null, aboutCorrespAuthor=null), CN=AuthorExt(id=1261746444647489934, tenantId=1045748351789510663, journalId=1206194551464398892, articleId=1261746443246592345, authorId=1261746444538438024, language=CN, stringName=萨仁高娃, firstName=null, middleName=null, lastName=null, prefix=null, suffix=null, authorComment=null, nameInitials=null, affiliation=null, department=null, xref=null, address=内蒙古师范大学计算机科学技术学院，内蒙古呼和浩特 010022, bio=null, bioImg=null, bioContent=null, aboutCorrespAuthor=null)}, companyList=[AuthorCompany(id=1261746444303556987, tenantId=1045748351789510663, journalId=1206194551464398892, articleId=1261746443246592345, xref=null, ext=[AuthorCompanyExt(id=1261746444320334204, tenantId=1045748351789510663, journalId=1206194551464398892, articleId=1261746443246592345, companyId=1261746444303556987, language=EN, country=null, province=null, city=null, postcode=null, companyName=null, departmentName=null, remark=College of Computer Science and Technology，I nner Mongolia Normal University，Hohhot 010022，China), AuthorCompanyExt(id=1261746444341305726, tenantId=1045748351789510663, journalId=1206194551464398892, articleId=1261746443246592345, companyId=1261746444303556987, language=CN, country=null, province=null, city=null, postcode=null, companyName=null, departmentName=null, remark=内蒙古师范大学计算机科学技术学院，内蒙古呼和浩特 010022)])]), Author(id=1261746444689432976, tenantId=1045748351789510663, journalId=1206194551464398892, articleId=1261746443246592345, orderNo=2, firstName=null, middleName=null, lastName=null, nameCn=null, orcid=null, stid=null, country=null, authorPic=null, dead=0, email=null, emailSecond=null, emailThird=null, correspondingAuthor=0, authorType=1, ext={EN=AuthorExt(id=1261746444756541842, tenantId=1045748351789510663, journalId=1206194551464398892, articleId=1261746443246592345, authorId=1261746444689432976, language=EN, stringName=Weijie DONG, firstName=Weijie, middleName=null, lastName=DONG, prefix=null, suffix=null, authorComment=null, nameInitials=null, affiliation=null, department=null, xref=null, address=College of Computer Science and Technology，I nner Mongolia Normal University，Hohhot 010022，China, bio=null, bioImg=null, bioContent=null, aboutCorrespAuthor=null), CN=AuthorExt(id=1261746444802679188, tenantId=1045748351789510663, journalId=1206194551464398892, articleId=1261746443246592345, authorId=1261746444689432976, language=CN, stringName=董伟杰, firstName=null, middleName=null, lastName=null, prefix=null, suffix=null, authorComment=null, nameInitials=null, affiliation=null, department=null, xref=null, address=内蒙古师范大学计算机科学技术学院，内蒙古呼和浩特 010022, bio=null, bioImg=null, bioContent=null, aboutCorrespAuthor=null)}, companyList=[AuthorCompany(id=1261746444303556987, tenantId=1045748351789510663, journalId=1206194551464398892, articleId=1261746443246592345, xref=null, ext=[AuthorCompanyExt(id=1261746444320334204, tenantId=1045748351789510663, journalId=1206194551464398892, articleId=1261746443246592345, companyId=1261746444303556987, language=EN, country=null, province=null, city=null, postcode=null, companyName=null, departmentName=null, remark=College of Computer Science and Technology，I nner Mongolia Normal University，Hohhot 010022，China), AuthorCompanyExt(id=1261746444341305726, tenantId=1045748351789510663, journalId=1206194551464398892, articleId=1261746443246592345, companyId=1261746444303556987, language=CN, country=null, province=null, city=null, postcode=null, companyName=null, departmentName=null, remark=内蒙古师范大学计算机科学技术学院，内蒙古呼和浩特 010022)])])] 顺毅,萨仁高娃,董伟杰. 基于预训练字素模型的蒙古语G2P研究[J]. 内蒙古师范大学学报（自然科学版）, 2026, 55(02): 205-213 DOI:10.3969/j.issn.1001-8735.2026.02.011

登录浏览全文

4963

注册一个新账户忘记密码

字素音素转换（grapheme-to-phoneme conversion，G2P）在语音技术领域发挥着重要作用。其本质是构建从字素（如字母、汉字等）^［1］构成的序列

G = {g 1, g 2, …, g n}

到对应音素^［2］序列

P = {p 1, p 2, …, p n}

的确定映射函数

f : G → P

，从而为声学模型提供底层表征。G2P作为语音合成系统（text-to-speech， TTS）中前端文本处理模块的关键部分，需完成分词边界检测、多音字消歧等复合功能。以中文合成系统为例，该模块将输入文本如“银行行长在银行行动”精确转换为音素序列/yin1 hang2 hang2 zhang3 zai4 yin2 hang2 xing2 dong4/，其中多音字“行”的消歧准确度直接影响合成语音的语义正确性。其次在自动语音识别（automatic speech recognition，ASR）领域^［3］，G2P可将ASR的输出字素序列转换为音素序列，协助判断其是否符合预期发音（如多音字、专有名词），也可对未登录词通过G2P推测发音，辅助生成更合理文本。

G2P的研究经历了三次范式变革：最初基于人工语言学规则的确定性建模^［4⁃5］，随后发展为基于统计概率的隐马尔可夫模型^［6］与决策树方法^［7］，目前是以深度学习架构为核心的技术体系，如LSTM^［8］，CNN^［9］，Transformer^［10］。近年来，国际研究聚焦于探索预训练模型^［11］和大语言模型在低资源语言G2P任务中的应用^{［12⁃13］}。

蒙古语G2P研究早期聚焦于基于规则的转换方法和基于联合序列模型的转换方法^［14］。随后，Liu等^［15］引入了结合注意力机制的LSTM模型，并进一步提出融合规则与LSTM的策略，有效提升了转换性能。然而，近年来该领域的研究较缺乏，鉴于Transformer架构在自然语言处理领域取得的突破性进展，以及其应用于英语G2P任务取得优于CNN和RNN的效果^［10］，本研究将表现优异的Transformer模型引入蒙古文G2P任务，作为基线模型，旨在系统探究该架构对蒙古语复杂音系结构的适应性及其性能表现，同时探究预训练GBERT（grapheme BERT，GBERT）模型在蒙古语G2P任务的性能表现。

1 方法

1.1 Transformer G2P

Transformer采用典型的编码器解码器架构，如图1所示^［16］，其核心模块多头注意力机制（图2）与LSTM中的注意力机制相比，具有如下三点优势：（1）通过并行化子空间建模捕捉多维上下文依赖，加速了训练进程；（2）允许序列中任意两个位置直接建立联系，可有效捕捉长距离音素依赖关系；（3）通过将输入投影到不同的子空间，每个头可以学习并关注输入序列的不同方面，使得模型对输入的理解更加全面和深入，明显提升表征能力。鉴于上述优势思考，Transformer 架构适合处理蒙古语复杂的形态音系结构。

1.2 GBERT集成Transformer的G2P

受BERT启发的预训练字素模型（GBERT）在SIGMORPHON 2021任务^［17］中对最具挑战性的4种语言（荷兰语、塞尔维亚-克罗地亚语、保加利亚语和韩语）进行了实验，结果表明，基于GBERT的G2P模型性能均优于Transformer基线模型。相较于Transformer模型，该方法凭借自监督学习机制，能够从未标注的蒙古文文本中自动学习字素间的上下文依赖关系，无需依赖大规模标注数据即可捕捉复杂的音变规则，这一特性契合了蒙古文G2P研究中面临的数据资源匮乏难题。

1.2.1 GBERT

预训练GBERT这一范式继承于BERT^［18］，其输入单元由传统BERT的词元（wordpiece）序列调整为单词内部的字符序列。GBERT使用Transformer编码器通过双向掩码实现上下文感知，如图2所示。由于在下游G2P任务的输入始终为完整的字符序列，为避免预训练与微调阶段的输入分布差异，采用了动态掩码策略：对输入字符实施80%的掩码符“_”替换、10%的随机字符替换以及10%的原字符保留。

1.2.2 Fine-tuning

为实现GBERT与现有Transformer-G2P框架的融合，有两种集成范式。第一种方法采用编码器微调策略，具体流程包含三个关键阶段。（1）架构扩展：在预训练的GBERT编码器基础上，新增随机初始化的Transformer解码器模块，构建完整的序列到序列架构。（2）联合训练：通过端到端优化同步更新编码器与解码器参数，其中编码器采用较低学习率（通常设置为解码器学习率的10%~20%）以保留预训练知识。（3）表征传递：编码器输出的上下文向量经跨模态注意力机制与解码器交互，逐步生成目标音素序列。该方法的优势在于通过差异化的学习率调控，既能充分利用预训练模型的语义表征能力，又能有效适应G2P任务的音系转换特性。

1.2.3 Fusing GBERT

第二种集成方法由Zhu等^［19］提出，如图3所示。在Transformer的编码器-解码器层级结构中分别嵌入可调节的GBERT注意力模块（GBERT-Enc/GBERT-Dec）。包含三个关键设计：（1）跨模态注意力交互，每个编码层通过多头注意力机制建立原始输入表征与GBERT编码输出的动态关联；（2）门控融合机制，在解码层设置可训练的门控权重，自适应调节GBERT解码表征与任务特定表征的信息融合比例；（3）残差连接优化，引入层级残差连接确保梯度有效传播，同时缓解深层网络退化问题。这种分层增强机制不仅实现了预训练知识的渐进式融合，还通过参数隔离策略（仅训练新增模块）明显提升模型的训练效率，在保证性能提升的同时避免了传统微调方法可能引发的灾难性遗忘现象。

输入序列

x

首先由GBERT编码成

H G = G B E R T (x)

，

H G

是GBERT最后一层的输出。对于Transformer编码器，第

l

层的隐藏表示为

H l

，输入序列中第

i

个单元在第

l

层的隐藏表示为

h i l

，在第

l

层（

l ∈ [L]

），通过式（1）方法将GBERT通过注意力机制集成到编码器。

h ˜ i l = 12 A t t n B h i l - 1, H G, H G + A t t n S h i l - 1, H l - 1, H l - 1,

（1）

其中

A t t n S

代表编码器自注意力模块（图3编码器右下绿色部分），

A t t n A

代表GBERT-Enc注意力模块（图3编码器左下绿色部分）。

h ˜ i l

再经过前馈网络

F N N (·)

得到第

l

层的输出

H E l

。在这一过程中注意力机制让编码器的每一层都能自适应地从GBERT输出中获取信息，将其与自身的表示融合。

对于Transformer的解码器，第

l

层在时间步

t

之前的隐藏状态为：

S < t l = s 1 l, s 2 l, …, s t - 1 l

，在第

l

层，先计算

s^t l = A t t n S s t l -, S < t + 1 l - 1, S < t + 1 l - 1

，通过式（2）得到融合表示。

s ˜ t l = 12 A t t n B s^t l, H G, H G + A t t n E s^t l, H l, H l,

（2）

其中

A t t n B

是GBERT-Dec模块（图3解码器左中绿色），

A t t n E 是 E n c - D e c A t t e n t i o n

模块（图3解码器右中绿色）。

s ˜ t l

经过

F F N s ˜ t l

得到

s t l

，最终通过线性变换和

s o f t m a x

得到预测音素。这使得Transformer解码器在生成音素序列时可以结合GBERT的信息以及编码器的输出，更好地进行音素预测。

编码器与解码器都采用drop net策略防止网络过度依赖某一种注意力模型。具体策略为在训练过程中，针对模型的每一层，以一个预先设定的drop net率

P n e t ∈ [0,1]

为基础进行操作。在每次训练迭代时，从［0，1］中均匀采样一个随机变量

U t

。以编码器的计算为例，所有的

h ˜ i l

会依据

U t

的取值来计算。当

U t < P n e t 2

时，只使用自注意力

A t t n S

；当

U t

>1-

P n e t 2

时，只使用自注意力

A t t n B

；当

P n e t 2 ≤ U t ≤

P n e t 2

时，同时使用

A t t n S

和

A t t n B

。

2 实验设计

2.1 数据集

本研究基于蒙古语发音词典构建音素数据集，数据项由三部分构成：第一列为采用Unicode编码的传统蒙古文字母（如ᠠᠢᠵᠠᠮ），第二列为蒙古文转写为对应的拉丁字母（如aijam），第三列为国际音标标注的拉丁文发音的语音细节（如［æːdʒɪm］）。依据蒙古发音词典建设^［20］得到25 079对蒙古文和其拉丁转写以及对应的音素序列。为解决字素与音素之间复杂的映射关系，为蒙古文发音提供一种机器可读的文字，参考CMUdict^{11 Carnegie Mellon University. Carnegie Mellon University Pronouncing Dictionary. Pittsburgh： Carnegie Mellon University， 2011.http：//www.speech.cs.cmu.edu/cgi-bin/cmudict.}设计了符号映射体系，见表1。采用周期抽样策略进行数据集划分，21 318例（85%）用作训练集，1 253例（5%）用作验证集，2 508例（10%）用作测试集。

2.2 训练过程

用于训练和推理的硬件是一台搭载了Intel（R） Xeon（R） Silver 4314 CPU@2.40 GHz处理器的服务器，服务器都配备了4块NVIDIA GeForce RTX 3090 GPU显卡（每块显卡配备24 GB显存）。

首先系统评估Transformer在蒙古文G2P任务上的性能，对关键架构配置开展消融实验。基线模型采用3层编码器-解码器架构，模型维度d_model=256，前馈网络隐藏层维度d_ff=256，多头注意力头数h=4，正则化策略dropout=0.2^［21］，参数调整范围见表2。在双NVIDIA RTX 3090 GPU（24 GB GDDR6X）上实施分布式训练，使用Adam优化器更新模型权重，学习率初值设置为0.001，当训练达到400周期时终止训练。

对于基于GBERT的G2P模型训练过程，首先采用20%掩码比例使用只含字素的训练集在相同服务器上预训练400周期。为严格对齐实验条件，沿用Transformer模型的同一套数据集划分比例及相同超参数配置，在双RTX 3090服务器上对GBERT分别进行微调与混合训练，均完整迭代400 epoch，以比较二者在蒙古文G2P任务上的鲁棒性差异。然后根据较优的方法进行超参数优化，参数调整范围见表3。为提高评估质量，两种实验方法均采用beam search^［22］搜索算法。

3 实验结果与分析

3.1 评估指标

本实验使用G2P任务中常用的词错误率（word error rate，WER）和音素错误率（phoneme error rate，PER）评估指标评价模型的性能，词错误率（WER）是指预测的音素序列与参考发音不完全匹配单词的百分比^［10］，表示为

R W E = N e r r o r s N w o r d s × 100 %

，（3）

其中，N_errors是预测的音素序列与参考发音序列之间不完全匹配的单词数量，N_words是参考发音中的单词总数。WER表示预测的音素序列与参考发音序列在单词层面上的不匹配情况，以百分比形式表示错误率。WER越低，表明系统在词汇层面上的准确性越高。

音素错误率（PER）作为G2P系统性能评估的量化指标，理论源自语音识别领域的序列对齐算法。其数学定义是预测音素序列与参考音素序列的最小编辑距离（minimum edit distance， MED），也称Levenshtein距离与参考序列长度的归一化比值^［10］，表示为

R P E = D (P p r e d, P r e f) N r e f

，（4）

其中，

P p r e d

表示模型的输出音素序列，

P r e f

为专家标注的参考音素序列，

D (P p r e d, P r e f)

为基于Levenshtein距离的编辑操作函数，

N r e f

表示参考音素序列的长度。该函数通过动态规划算法计算实现序列对齐所需的最小编辑操作次数，其中归一化处理使得PER能够适用不同长度的音素序列评估。

3.2 Transformer实验结果及与基线模型对比分析

基于Transformer模型的实验结果见表4。整体上，3层模型在d_ff=1 024时达到最优词错误率（WER为13.64%）和音素错误率（PER为3.47%），表明适度的层数结合更大的前馈网络维度能有效捕捉蒙古文音素转换规律。5层模型在d_ff=1 024时PER进一步降低至3.44%，但WER略升至13.68%。前馈网络维度的提升（如从256增至1 024）对3层和5层模型效果明显，但在4层模型中反而导致性能下降，表明层数与维度的平衡至关重要。实验还发现，模型容量扩展（如5层+ 1 024维度）虽能提升音素精度，但计算成本较高，而3层+1 024维度在WER和计算效率间更具实用性。

本实验的基线模型是基于联合序列模型（joint-sequence model）的蒙古文G2P转换系统。对于蒙古文Sequitur G2P（基于联合序列模型G2P），其在不同模型阶数下的性能表现见表5^［11］。通过数据对比发现Transformer模型全面超越传统联合序列模型方法，尤其在词错误率上实现突破性提升。传统模型因过拟合和长程依赖建模缺陷，难以适应蒙古文的语言特性；而Transformer通过全局注意力机制，明显增强了音素到词的映射鲁棒性。若实际场景需部署轻量级模型，3层Transformer（d_ff=1 024）是更优选择。

3.3 基于GBERT模型实验结果

GBERT预训练模型结构超参数配置为6层Transformer，隐藏层维度为256，注意力的头数为4，前馈层的维度为1 024，训练配置等效批大小为1 024（梯度累计），训练400周期，学习预热+衰减。正则化Dropout=0.1，标签平滑（系数0.1），无权重衰减，不同的掩码比例训练结果见表6。无论是训练集还是验证集，准确率均随掩蔽率提升而上升，表明适当增加掩蔽比例可增强模型对蒙古文字素的上下文推理能力，但同时由于蒙古语低资源数据，20%的掩蔽率下验证集准确率仅62.12%。

第一种方法微调GBERT的编码器默认超参数配置与上面相同，而解码器的超参数配置为3层解码器，多头注意力头数为4，前馈层维度为1 024，Dropout=0.2。训练超参数梯度累计前批次大小为256，梯度累计后批次大小为1 024，编码器的学习率为0.000 03。第二种方法通过注意力机制将GBERT集成到基于Transformer的G2P模型。冻结GBERT参数，GBERT注意力层的Dropout比率为0.5。编码器与解码器层数为3，隐藏层维度都为256，注意力头数为4，前馈层维度为1 024，Dropout=0.2。训练参数初始学习率为0.001，训练总轮次为400次，实际训练的批次大小为128，使用Relu代替GELU作为激活函数。Transformer基线模型配置与第二种方法的Transformer参数配置相同，训练周期为400，实验结果见表7。

基于相同架构的对比实验显示，GBERT微调模型在词级和音素级性能均出现下降，GBERT注意力模型通过直接复用预训练阶段的注意力权重，在保留预训练语言理解能力的同时，WER较基准模型（Transformer）降低0.08%，PER升高了0.03%，表明预训练注意力机制对蒙古文音素转换的长程依赖建模具有正向作用。这一结果揭示在蒙古语G2P任务下，直接迁移预训练注意力模式比全参数微调更有效。针对GBERT注意力模型进行参数微调，实验结果见表8。

从表8可知，基于GBERT注意力机制的模型在蒙古文G2P任务中整体优于标准Transformer架构，尤其在深层网络中展现出更强的鲁棒性。当模型层数从3层增至5层时，GBERT注意力模型的词错误率（WER）和音素错误率（PER）波动明显小于Transformer。此外，GBERT注意力模型在多数配置下呈现更稳定的维度扩展性——3层512维时WER（13.52%）明显优于Transformer（14.15%），而5层1 024维时PER（3.45%）与Transformer（3.44%）基本持平，表明其通过迁移预训练的语言理解能力，在中等参数规模下即可实现性能饱和。在探究dropout参数对GBERT attention鲁棒性影响时，对最优配置（3，512，4）进行消融实验，结果见表9。实验结果表明dropout率对蒙古文G2P模型性能具有明显影响，且低dropout率（0.1）效果最优。当dropout率从0.1增至0.3时，词错误率从12.84%逐步上升至14.39%，音素错误率从3.22%增至3.81%，反映出过高的正则化会损害模型的学习能力。

以上实验结果表明：通过预训练注入语言先验知识，可提升低资源语言G2P任务的模型鲁棒性；且模型性能并非随网络深度或维度单调提升，需根据任务特性寻找最优参数配置点。对低资源蒙古文G2P任务，优先探索轻量化架构+预训练知识迁移的组合，而非盲目地增加模型复杂度。值得注意的是，正则化强度的设置需遵循动态适配原则，即与模型容量、数据分布特性形成精准匹配。实验数据证实，过度正则化（如dropout>0.2）会抑制预训练知识的表达迁移，因此在微调阶段应采用弱正则化策略（dropout≤0.1）。

4 结论

本研究首次将Transformer和基于预训练字素的模型架构引入蒙古文G2P转换任务，通过构建的蒙古文字素音素对齐语料库，探索了模型编码器和解码器层数、前馈网络中间层维度等超参数配置对蒙古文G2P性能的影响。实验结果表明，基于Transformer的模型较联合序列模型提升16.3%，其自注意力机制有效捕捉了蒙古文黏着语的长程依赖特性。其次，引入预训练的GBERT注意力机制可进一步提升深层模型鲁棒性，将WER降至12.84%，且正则化策略需谨慎设计，低dropout率（0.1）明显优化性能。研究表明，在蒙古文G2P任务中，轻量化模型（3层+1 024d_ff）结合预训练注意力和适度正则化（dropout=0.1）是实用的最优解。本文在以下方面做出了贡献：验证了Transformer模型与基于预训练字素的模型对蒙古文G2P任务的敏感性规律；构建了蒙古文字素⁃音素对齐语料库；为后续蒙古文G2P研究提供了可复现的基线参照与参数调优依据。

参考文献

原文顺序 | 出版日期 | 本文引用

[1]	MELETIS D. The grapheme as a universal basic unit of writing［J］. Writing Systems Research， 2019， 11 （1）： 26-49.

[2]	邢福义，吴振国.语言学概论［M］.武汉：华中师范大学出版社，2002：77-87.

[3]	金秀丽.端到端语音识别算法研究与实现［D］.兰州：兰州交通大学，2023.

[4]	ELOVITZ H， JOHNSON R， MCHUGH A， et al. Letter-to-sound rules for automatic translation of English text to phonetics［J］. IEEE Transactions on Acoustics， Speech， and Signal Processing， 1976， 24（6）： 446-459.

[5]	DAMPER R I， MARCHAND Y， ADAMSON M J， et al. Evaluating the pronunciation component of text-to-speech systems for English： A performance comparison of different approaches［J］. Computer Speech & Language， 1999， 13（2）： 155-176.

[6]	CHENG S Y， ZHU P C， LIU J T， et al. A survey of grapheme-to-phoneme conversion methods［J］. Applied Sciences， 2024， 14（24）： 11790： 1-20.

[7]	HÄKKINEN J， SUONTAUSTA J， RIIS S， et al. Assessing text-to-phoneme mapping strategies in speaker independent isolated word recognition［J］. Speech Communication， 2003， 41（2-3）： 455-467.

[8]	RAO K， PENG F C， SAK H， et al. Grapheme-to-phoneme conversion using Long Short-Term Memory recurrent neural networks［C］//2015 IEEE International Conference on Acoustics， Speech and Signal Proces sing （ICASSP）. South Brisbane： IEEE， 2015： 4225-4229.

[9]	YOLCHUYEVA S， NÉMETH G， GYIRES-TÓTH B. Grapheme-to-phoneme conversion with convolutional neural networks［J］. Applied Sciences， 2019， 9（6）： 1143-1162.

[10]	YOLCHUYEVA S， NÉMETH G， GYIRES-TÓTH B. Transformer based grapheme-to-phoneme conversion ［C］// Proceedings of the 20th Annual Conference of the International Speech Communication Association. Graz： ISCA， 2019： 2095-2099.

[11]	DONG L， GUO Z Q， TAN C H， et al. Neural grapheme-to-phoneme conversion with pre-trained grapheme models［C］// 2022 IEEE International Conference on Acoustics， Speech and Signal Processing （ICASSP）. Singapore： IEEE， 2022： 6202-6206.

[12]	NOVAK J R， MINEMATSU N， HIROSE K. Phonetisaurus： Exploring grapheme-to-phoneme conversion with joint n-gram models in the WFST framework［J］. Natural Language Engineering， 2016， 22（6）： 907-938.

[13]	QHARABAGH M F， DEHGHANIAN Z， RABIEE H R. LLM-powered grapheme-to-phoneme conversion： Benchmark and case study［C］//2025 IEEE International Conference on Acoustics， Speech and Signal Processing （ICASSP）. Hyderabad： IEEE， 2025： 1-5.

[14]	飞龙，高光来，闫学亮. 蒙古文字母到音素转换方法的研究［J］. 计算机应用研究， 2013， 30（6）： 1696-1700.

[15]	LIU Z N， BAO F L， GAO G L， et al. Mongolian grapheme to phoneme conversion by using hybrid approach ［M］// ZHANG M， NG V， ZHAO D， et al. Natural Language Processing and Chinese Computing. Cham： Springer， 2018： 40-50.

[16]	VASWANI A， SHAZEER N， PARMAR N， et al. Attention is all you need ［C］// WALLACH H M， LAROCHELLE H， BEYGELZIMER A， et al. Advances in Neural Information Processing Systems 30 （NIPS 2017）. Red Hook： Curran Associates Inc.， 2017： 5998-6008.

[17]

ASHBY L F E， BARTLEY T M， CLEMATIDE S， et al. Results of the second SIGMORPHON shared task on multilingual grapheme-to-phoneme conversion ［C］// Proceedings of the 18th SIGMORPHON Workshop on Computational Research in Phonetics， Phonology， and Morphology. Stroudsburg： Association for Computational Linguistics， 2021： 115-125.

[18]

DEVLIN J， CHANG M W， LEE K， et al. BERT： Pre-training of deep bidirectional transformers for language understanding ［C］// BURSTEIN J， DORAN C， SOLORIO T. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies （NAACL-HLT 2019）. Stroudsburg： Association for Computational Linguistics， 2019： 4171-4186.

[19]	ZHU J， XIA Y， WU L， et al. Incorporating BERT into neural machine translation ［C］// 2020 International Conference on Learning Representations （ICLR）. Addis Ababa： Open Review Net， 2020： 1-17.

[20]	萨仁高娃. 蒙古语发音词典建设及其语音识别的应用研究［D］. 呼和浩特：内蒙古大学， 2021： 45-78.

[21]	SRIVASTAVA N， HINTON G， KRIZHEVSKY A. Dropout： A simple way to prevent neural networks from overfitting［J］. Journal of Machine Learning Research， 2014， 15（56）： 1929-1958.

[22]	HUANG L， ZHAO K， MA M. When to finish？ Optimal beam search for neural text generation （modulo beam size）［D］. Ithaca： Cornell University， 2018.