The semantics of Chinese vocabulary have a certain degree of ambiguity. In Chinese text, there are some features that have low relevance to named entity recognition. The same vocabulary has different meanings in different contexts, and different vocabulary and phrases have different contributions to named entity recognition. If weighting or masking operations are not performed, these features will interfere with the recognition accuracy of the model. To this end, a Chinese named entity recognition (CNER) algorithm with soft attention mask embedding is studied. Establish a multi-level CNER model, in the word vector representation layer of the model, use jieba technology to perform segmentation processing on the Chinese text passed from the input layer, and use Word2Vec method to obtain the word vectors of each vocabulary, forming a sequence of word vectors. In the BiLSTM layer, bidirectional long short-term memory processing is applied to the sequence of word vectors to obtain feature vectors that fuse contextual information for each word vector. Embedding a soft attention mask module after the BiLSTM layer, using the soft attention mechanism of this module to perform weighted and masked operations on the feature vectors output by the BiLSTM layer, focusing on features that contribute significantly to entity recognition, removing and suppressing unimportant features, and improving recognition accuracy. Label and decode the feature vectors processed by the soft attention mask module in the CRF layer to obtain the optimal entity label sequence, which is the Chinese named entity recognition result. The experiment shows that the algorithm can accurately recognize Chinese named entities, and has good performance in entity label annotation coverage and F1 value.
中文命名实体识别(Chinese named entity recognition,CNER)是自然语言处理(Natural language processing,NLP)领域的核心技术之一,它能够从中文文本中准确识别出人名、地域名、组织名等关键实体[1,2]。对信息提取、知识关系挖掘等方面有重要意义,同时也是实现智能对话系统、搜索引擎优化等应用的基础[3-5],对推动NLP技术的发展和应用有较强的必要性。但由于中文词汇的语义具有模糊性,同一个词汇在不同的上下文中具有不同的含义,中文命名实体的识别依赖于上下文信息,增加了识别的复杂性。为此,需要研究一种有效的中文命名实体识别算法。
关于命名实体识别问题,诸多学者进行了大量研究,李健等[6]通过Transformer模型对中文文本序列进行特征抽取,并将得到的特征序列作为输入信息传递给隐马尔科夫模型进行标记预测,进而实现CNER自动识别。但该方法在处理长文本时,由于Transformer模型受计算效率和存储空间的限制,导致隐马尔科夫模型在处理其传递的特征对会出现信息损失的情况,影响最终的识别效果。Jeon等[7]结合建筑结构词库中的先验知识,通过TrAdaBoost算法对构建好的命名实体识别(Named entity recognition, NER)模型进行微调,使其能够适应噪声文本的特点,从而准确识别出与建筑结构相关的命名实体。尽管TrAdaBoost算法可以提高NER模型的性能,但其却过度依赖建筑结构词库的先验知识,会导致NER模型在处理新型命名实体时泛化能力不足,从而降低识别准确率。方红等[8]依据专有名词词典,并结合句法依存树捕捉词汇间的句法依赖结构,构建文本信息图结构,将其作为输入信息传递至图神经网络模型中,利用该模型实现命名实体识别。该方法中专有名词词典虽然提供了一定的命名实体识别线索,但是其覆盖范围和更新速度无法满足实际应用的需求,从而导致识别结果产生偏差。廖梦等[9]利用双向编码表征模型(Bidirectional encoder representations from transformers,BERT)获取中文文本字符嵌入,通过Transformer解码器使字符与标签向量交互学习,增强字符特征,并引入多任务学习模式,对建立的循环神经网络(Recurrent neural network,RNN)模型进行优化训练,利用该模型实现CNER的识别。该方法虽对RNN模型进行了优化训练,但其本身在处理长序列文本时容易出现梯度消失的问题,从而影响识别精度。
由图1可知,本文设计的CNER模型具有多个层次,其中,利用词向量表示层对输入层传递过来的中文文本进行分词并获取其词向量,形成词向量序列;利用双向长短期记忆(Bi-directional long short-term memory,BiLSTM)层对获取的词向量序列进行双向长短期记忆处理,得到每个时间步的隐藏状态序列(即中文词向量序列中每个词向量对应的融合了前后文信息的特征向量);在BiLSTM层后嵌入一个软注意力掩码模块,利用该模块的软注意力机制对BiLSTM层的输出进行加权处理,以强调对命名实体识别有重要贡献的特征,通过掩码操作抑制不重要的特征,该模块输出的是经过软注意力机制加权和掩码处理后的中文词特征向量序列[10,11];最后利用条件随机场(Conditional random field,CRF)层执行标签标注与解码操作,在得到的所有实体标签序列集合中,找到一个与输入序列最匹配的标签序列,由此得出最终的命名实体识别结果。
WangYing-jie, ZhangCheng-ye, BaiFeng-bo, et al. Review of Chinese named entity recognition research[J]. Journal of Frontiers of Computer Science & Technology, 2023, 17(2): 324-341.
ZhaoJi-gui, QianYu-rong, WangKui, et al. Survey of Chinese named entity recognition research[J]. Computer Engineering and Applications, 2024, 60(1): 15-27.
KangYi-lin, SunLu-bing, ZhuRong-bo, et al. Survey on Chinese named entity recognition with deep learning [J]. Journal of Huazhong University of Science and Technology (Natural Science Edition), 2022, 50(11): 44-53.
ZhangYun, HuangCheng, ZhangYu-yao, et al. Chinese named entity recognition with few labeled data[J]. Journal of Chinese Information Processing, 2023, 37(3): 101-111.
LiJian, XiongQi, HuYa-ting, et al. Chinese named entity recognition method based on Transformer and hidden Markov model [J]. Journal of Jilin University (Engineering and Technology Edition), 2023, 53(5): 1427-1434.
[13]
JeonK, LeeG, YangS, et al. Named entity recognition of building construction defect information from text with linguistic noise[J]. Automation in Construction, 2022, 143: No.104543.
FangHong, SuMing, FengYi-bo, et al. Chinese named entity recognition combined with gazetteers and syntactic dependency tree[J]. Computer Engineering and Applications, 2022, 58(18): 227-232.
LiaoMeng, JiaZhen, LiTian-rui. Chinese named entity recognition based on label information fusion and Multi-task learning[J]. Computer Science, 2024, 51(3): 198-204.
ChenWei-da, WangLin-fei, TaoDa-peng. SAME-net:scene text recognition method based on soft attention mask embedding[J]. Journal of Image and Graphics, 2024, 29(5): 1381-1391.
ZhanWen-tao, WuXiao-ling, LingJie. Chinese named entity recognition based on multi-window attention mechanism[J]. Journal of Chinese Computer Systems, 2024, 45(6): 1325-1330.
ZhaoDan-dan, HuangDe-gen, MengJia-na, et al. Chinese named entity recognition by integrating multi-heads attention mechanism and character and words fusion[J]. Computer Engineering and Applications, 2022, 58(7): 142-149.
LiJun-huai, ChenMiao-miao, WangHuai-jun, et al. Chinese named entity recognition method based on ALBERT-BGRU-CRF[J]. Computer Engineering, 2022, 48(6): 89-94, 106.