多种注意力机制的 AViT-UNet 高效医学影像分割方法

徐浩斌; 李振东; 武艺强; 刘昊; 李帅

doi:10.12178/1001-0548.2025107

电子科技大学学报 ›› 2026, Vol. 55 ›› Issue (3) : 473 -480. DOI: 10.12178/1001-0548.2025107

计算机工程与应用

多种注意力机制的 AViT-UNet 高效医学影像分割方法

徐浩斌 ¹ ,
李振东 ¹^,²^,^* ,
武艺强 ¹^,² ,
刘昊 ¹^,² ,
李帅 ¹^,²

作者信息 +

AViT-UNet: An efficient medical image segmentation method based on multiple attention mechanisms

Haobin XU ¹ ,
Zhendong LI ¹^,²^,^* ,
Yiqiang WU ¹^,² ,
Hao LIU ¹^,² ,
Shuai LI ¹^,²

Author information +

文章历史 +

PDF (3101K)

摘要

针对现有的医学影像语义分割方法复杂度高、参数量大、精度低及无法在低配置及医院边缘部署设备等实际问题，提出了一种基于多种注意力机制的 vision transformer U-Net 型轻量级医学影像语义分割方法 AViT-UNet 。首先设计了轻量化的卷积模块（ LDB ）并应用于编码 − 解码层的卷积模块，降低了模型的计算复杂度。其次，引入了自注意力机制模块 EMHA ，在深层网络与瓶颈层进行应用，加强了分割精度。最后，针对跳跃连接与特征输入部分，网络使用通道注意力、空间注意力等机制，加强了残差连接与卷积深度，使分割结果更加精细。该方法有效弥补了 Transformer 的高计算量与卷积神经网络在捕获全局特征方面的不足，在轻量化网络的同时提高了语义分割的精度，使语义分割网络能够部署在配置有限的医疗设备和移动平台上。在 Synapse 、 GlaS 和 MoNuSeg 这 3 个公开医学影像语义分割基准数据集上进行多维度评测指标验证，结果证明了该方法具有一定的先进性和可行性。具体实现代码已上传至 https://github.com/shepherdxu/AViT-UNet 。

Abstract

Existing medical image semantic segmentation methods suffer from high computational complexity, large parameter counts, suboptimal accuracy, and inability to be deployed on low-resource and clinical edge devices. To address these issues, AViT-UNet, a lightweight vision transformer U-Net model incorporating multiple attention mechanisms is proposed to reduce model size and latency while maintaining competitive segmentation performance. Firstly, a lightweight convolutional module, lightweight dilated bottleneck (LDB), is designed in this model and applied to the convolutional module of the encoding-decoding layer, which significantly reduces the computational complexity of the model. Secondly, a self-attention mechanism module, efficient multi-head attention (EMHA), is invoked and applied in the deep network and bottleneck layer to enhance the segmentation accuracy. Finally, to enhance the fidelity of skip connections and feature fusion, the network integrates channel and spatial attention mechanisms to bolster residual pathways and deepen convolutional representations, yielding more precise segmentation outputs. This strategy effectively compensates the high computational demands of transformer-based models and the limited global receptive field of conventional convolutional neural networks. As a result, the proposed lightweight architecture achieves superior semantic segmentation accuracy while remaining suitable for deployment on resource-constrained medical devices and mobile platforms. The proposed method is validated on three publicly available medical image semantic segmentation benchmark datasets, Synapse, GlaS, and MoNuSeg, with multi-dimensional evaluation metrics. Experimental results fully prove that this method has a certain degree of advancement and feasibility. The specific implementation code of the method has been uploaded to https://github.com/shepherdxu/AViT-UNet.

关键词

语义分割 / 卷积神经网络 / 视觉 Transformer / 注意力机制

Key words

semantic segmentation / convolutional neural networks / vision Transformer / attention mechanism

引用本文

引用格式 ▾

徐浩斌,李振东,武艺强,刘昊,李帅. 多种注意力机制的 AViT-UNet 高效医学影像分割方法[J]. 电子科技大学学报, 2026, 55(3): 473-480 DOI:10.12178/1001-0548.2025107

登录浏览全文

4963

注册一个新账户忘记密码

参考文献

原文顺序 | 出版日期 | 本文引用

[1]	HATAMIZADEH A , TANG Y C , NATH V , et al. UNETR: Transformers for 3D medical image segmentation[C]// Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. [S.l.]: IEEE, 2022: 10.1109/wacv51458.2022.00181.

[2]	CHEN J , LU Y , YU Q , et al. TransUNet: Transformers make strong encoders for medical image segmentation[EB/OL]. [ 2025—03—11]. https://arxiv.org/pdf/2102.04306 .

[3]	RONNEBERGER O , FISCHER P , BROX T . U—Net: Convolutional networks for biomedical image segmentation[C]// Medical Image Computing and Computer—Assisted Intervention — MICCAI 2015. Cham: Springer International Publishing, 2015: 234-241.

[4]	ZHOU Z W , RAHMAN SIDDIQUEE M M, TAJBAKHSH N , et al. UNet++: A nested U—Net architecture for medical image segmentation[C]// Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support. Cham: Springer International Publishing, 2018: 3-11.

[5]	CHEN L C , ZHU Y K , PAPANDREOU G , et al. Encoder—decoder with atrous separable convolution for semantic image segmentation[C]// Computer Vision — ECCV 2018. Cham: Springer International Publishing, 2018: 833-851.

[6]	CHEN L C , PAPANDREOU G , SCHROFF F , et al. Rethinking atrous convolution for semantic image segmentation[EB/OL]. [2025—03—11]. https://arxiv.org/pdf/1706.05587 .

[7]	ÇIÇEK Ö , ABDULKADIR A , LIENKAMP S S , et al. 3D U—Net: Learning dense volumetric segmentation from sparse annotation[C]// Medical Image Computing and Computer—Assisted Intervention — MICCAI 2016. Cham: Springer International Publishing, 2016: 424-432.

[8]	JIN Q , MENG Z , SUN C , et al. RA—UNet: A hybrid deep attention—aware network to extract liver and tumor in CT scans[J]. Front Bioeng Biotechnol, 2020, 8: 605132.

[9]	AZAD R , FAYJIE A R , KAUFFMANN C , et al. On the texture bias for few—shot CNN segmentation[EB/OL]. [ 2025—03—11]. https://arxiv.org/pdf/2003.04052 .

[10]	VASWANI A , SHAZEER N , PARMAR N , et al. Attention is all you need[C]// Advances in Neural Information Processing Systems 30. [S.l.]: Curran Associates, 2017: 5998-6008.

[11]	DOSOVITSKIY A , BEYER L , KOLESNIKOV A , et al. An image is worth 16x16 words: Transformers for image recognition at scale[EB/OL]. [2025—03—11]. https://arxiv.org/pdf/2010.11929 .

[12]	CAO H , WANG Y Y , CHEN J , et al. Swin—unet: Unet—like pure transformer forMedical image segmentation[C]// Computer Vision — ECCV 2022 Workshops. Cham: Springer Nature Switzerland, 2023: 205-218.

[13]	HEIDARI M , KAZEROUNI A , SOLTANY M , et al. HiFormer: Hierarchical multi—scale representations using transformers for medical image segmentation[C]// Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. Piscataway: IEEE, 2023: 6202-6212.

[14]	王雪, 李占山, 吕颖达 . 基于多尺度感知和语义适配的医学图像分割算法[J]. 吉林大学学报（工学版）, 2022, 52(3): 640-647.

[15]	WANG X , LI Z S , LYU Y D . Medical image segmentation based on multi—scale context—aware and semantic adaptor[J]. Journal of Jilin University (Engineering and Technology Edition), 2022, 52(3): 640-647.

[16]	ZHU W H , CHEN X W , QIU P J , et al. SelfReg—UNet: Self—regularized UNet for medical image segmentation[C]// Medical Image Computing and Computer Assisted Intervention — MICCAI 2024. Cham: Springer Nature Switzerland, 2024: 601-611.

[17]	DALVI F , SAJJAD H , DURRANI N , et al. Analyzing redundancy in pretrained transformer models[C]// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2020: 4908-4926.

[18]	LI H , KADAV A , DURDANOVIC I , et al. Pruning filters for efficient ConvNets[EB/OL]. [ 2025—03—11]. https://arxiv.org/pdf/1608.08710 .

[19]	LI L J . Self—regulated feature learning via teacher—free feature distillation[C]// Computer Vision — ECCV 2022. Cham: Springer Nature Switzerland, 2022: 347-363.

[20]	LI X , ZHU W H , DONG X Z , et al. EViT—UNET: U—Net like efficient vision transformer for medical image segmentation on mobile and edge devices[C]// Proceedings of the IEEE 22nd International Symposium on Biomedical Imaging. [S.l.]: IEEE, 2025: 10981108.

[21]	XU G A , LI J C , GAO G W , et al. Lightweight real—time semantic segmentation network with efficient transformer and CNN[J]. IEEE Transactions on Intelligent Transportation Systems, 24(12): 15897-15906.

[22]	HE K M , ZHANG X Y , REN S Q , et al. Deep residual learning for image recognition[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. [S.l.]: IEEE, 2016: 770-778.

[23]	WANG Q L , WU B G , ZHU P F , et al. ECA—net: Efficient channel attention for deep convolutional neural networks[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. [S.l.]: IEEE, 2020: 11531-11539.

[24]	OKTAY O , SCHLEMPER J , FOLGOC L L , et al. Attention U—Net: Learning where to look for the pancreas[EB/OL]. [ 2025—03—11]. https://arxiv.org/pdf/1804.03999 .

[25]	LU Z , LI J , LIU H , et al. Supplementary material of transformer for single image super—resolution[EB/OL]. [ 2025—03—11]. https://openaccess.thecvf.com/content/CVPR2022W/NTIRE/supplemental/Lu_Transformer_for_Single_CVPRW_2022_supplemental.pdf .

[26]	WOO S , PARK J , LEE J Y , et al. CBAM: Convolutional Block attention module[C]// Computer Vision — ECCV 2018. Cham: Springer International Publishing, 2018: 3-19.

[27]	LANDMAN B , XU Z , IGELSIAS J , et al. Miccai multi—atlas labeling beyond the cranial vault—workshop and challenge[EB/OL]. [2025—03—12]. https://www.synapse.org/Synapse:syn3193805 .

[28]	SIRINUKUNWATTANA K , PLUIM J P W , CHEN H , et al. Gland segmentation in colon histology images: The glas challenge contest[J]. Medical Image Analysis, 2017, 35: 489-502.

[29]	KUMAR N , VERMA R , SHARMA S , et al. A dataset and a technique for generalized nuclear segmentation for computational pathology[J]. IEEE Trans Med Imaging, 2017, 36(7): 1550-1560.

[30]	XU G P , ZHANG X , HE X W , et al. LeViT—UNet: Make faster encoders withTransformer forMedical image segmentation[C]// Pattern Recognition and Computer Vision. Singapore: Springer Nature Singapore, 2024: 42-53.

[31]	DING H W , CUI X H , CHEN L Y , et al. MRU—NET: A U—shaped network for retinal vessel segmentation[J]. Applied Sciences, 2020, 10(19): 6823.

[32]	QI Q F , LIN L Y , ZHANG R , et al. MEDT: Using multimodal encoding—decoding network as in transformer for multimodal sentiment analysis[J]. IEEE Access, 2022, 10: 28750-28759.

[33]	WANG H N , CAO P , WANG J Q , et al. UCTransNet: Rethinking the skip connections in U—Net from a channel—wise perspective with transformer[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2022, 36(3): 2441-2449.