Tangut script, which is characterized by its complex strokes and is an extinct writing system, has recently seen the use of convolutional neural networks become mainstream for its recognition. With only 667 annotated characters in the existing dataset, efforts were made to improve the model’s performance by addressing issues such as overfitting due to the limited sample size and the long-tail problem caused by data imbalance, using data augmentation and transfer learning methods. The study compared the performance of baseline models with those incorporating these improvement strategies. The results show that the model, which combined both strategies, achieved an average accuracy improvement of 5.65%. Additionally, an improved model named YOLOv8-VNeXt was proposed, which can serve as a reference for future research transitioning from single-character recognition based on image classification to multi-character recognition based on target detection.
回顾以往的文字识别模型,单字识别已经无法满足人们对文字识别的更高需求。以其他语言文字的研究为例,2018年,丁明宇等[1]将深度学习的检测算法与传统的OCR技术相结合,实现了对图片中商品参数的识别。他们提出,目标检测与文字识别均可通过卷积神经网络(convolutional neural network, CNN)实现端到端(end to end)的任务流程,从而省去了单字识别中图片切割的繁琐步骤。2020年,Santoso等[2]使用YOLO(you only look once)模型实现了对雕刻在铜板上的卡维文字的识别。近年来,国内对古文字识别的研究也广泛采用了多种改进的目标检测模型[3-5]。
在经典的目标检测算法中,CNN被广泛应用,常见的算法包括Faster R-CNN[11],YOLO[12]和SSD[13](single shot multibox detector)。根据是否先独立生成候选目标区域,这些算法可以分为两类:单阶段算法和两阶段算法,其中,Faster R-CNN属于两阶段算法,而YOLO和SSD属于单阶段算法。目标检测的输出通常包括物体的边界框和类别。常用的评价指标包括准确率(precision)、召回率(recall)、F1分数、IoU(intersection over union)、mAP(mean average precision)。这些指标分别定义为
其中:TP表示真正例;FP表示假正例,FN表示假反例;A是预测边界框,B是事实边界框;AP i 是某类别准确率和召回率形成的曲线(PR曲线)在坐标系中围成的面积;C是所有类别的集合。准确率和召回率通常呈负相关关系。为了综合评估模型的性能,使用F1分数作为二者的调和平均值。IoU用于衡量边界框的匹配程度,其值越接近 1,表明预测边界框与真实边界框的重叠度越高。mAP越高,说明模型在所有类别上的召回率和准确率的综合表现越好,模型的整体性能也越出色。
目标检测在文字识别中的应用通常包括两个步骤:首先检测文本区域,然后对这些区域中的文字进行识别。这一方法在车牌识别、场景文本识别、文档扫描等任务中发挥了重要作用。为促进该技术的发展,国内外研究者公开了许多高质量的数据集。例如,清华大学发布的的Chinese Text in Wild(CTW)[14],康奈尔大学的COCO-Text[15],以及ICPR MTWI 2018[16]挑战赛数据集。目前,西夏文字在该领域的研究仍处于起步阶段。为弥补这一领域的空白,本文在最新的YOLOv8模型基础上,针对西夏文字的特点进行了改进,相关改进结果见图2和图3。
YOLOv8模型的基础模块C2f由多个Bottleneck模块堆叠而成,并通过通道拼接实现残差结构。文中,将原模型的部分C2f模块替换为ConvNeXt的基本模块。这一改进在减少模块堆叠数量的同时,引入了层标准化(layer normalization, LN)和GELU(Gaussian error linear unit)激活函数。这两种机制最初被应用于自然语言处理领域的Transformer模型中。随着Swin Transformer[17]在计算机视觉领域的成功,ConvNeXt的作者[18]对ConvNeXt的架构进行了系统分析,并认为层标准化和GELU能够显著提升模型性能,因此将其应用于ConvNeXt的基本模块中。
SANTOSOR, SUPRAPTOY K, YUNIARNOE M. Kawi character recognition on copper inscription using YOLO object detection[C]//International Conference on Computer Engineering, Network, and Intelligent Multimedia (CENIM). Surabaya, Indonesia:IEEE, 2020: 343-348.
SIMONYANK, ZISSERMANA. Very deep convolutional networks for large-scale image recognition[EB/OL].(2015-04-10)[2023-09-06].
[7]
SZEGEDYC, LIUWei, JIAYangqiing, et al. Going deeper with convolutions[C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, MA, USA:IEEE, 2015: 1-9.
[8]
HEKaiming, ZHANGXiangyu, RENShaoqong, et al. Deep residual learning for image recognition[C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, USA:IEEE, 2016: 770-778.
[9]
HUANGGao, LIUZhuang, VAN DER MAATENL, et al. Densely connected convolutional networks[C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, USA:IEEE, 2017: 2261-2269.
[10]
TANMingxing, LEQ V. EfficientNet: Rethinking model scaling for convolutional neural networks[EB/OL].(2020-11-11)[2023-09-06].
[11]
RENShaoqing, HEKaiming, GIRSHICKR, et al. Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 39(6): 1137-1149.
[12]
REDMONJ, DIVVALAS, GIRSHICKR, et al. You only look once: Unified, real-time object detection[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, USA. IEEE, 2016: 779-788.
[13]
LIUWei, ANGUELOVD, ERHAND, et al. SSD: Single shot multibox detector[M]// Computer Vision-ECCV 2016. Cham: Springer International Publishing, 2016: 21-37.
[14]
YUANTailing, ZHUZhe, XUKun, et al. A large Chinese text dataset in the wild[J]. Journal of Computer Science and Technology, 2019, 34(3): 509-521.
[15]
VEITA, MATERAT, NEUMANNL, et al. COCO-Text: Dataset and benchmark for text detection and recognition in natural images[EB/OL]. (2020-11-11)[2023-09-06].
[16]
HEMengchao, LIUYuliang, YANGZhibo, et al. ICPR2018 contest on robust reading for multi-type web images[C]//2018 24th International Conference on Pattern Recognition (ICPR). Beijing, China. IEEE, 2018: 7-12.
[17]
LIUZhe, LINYutong, CAOYue, et al. Swin transformer: hierarchical vision transformer using shifted windows[C]//IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, QC, Canada:IEEE, 2021: 9992-10002.
[18]
LIUZhuang, MAOHanzi, WUChaoyuan, et al. A ConvNet for the 2020s[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans, LA, USA:IEEE, 2022: 11966-11976.