1.School of Biomedical Engineering, Southern Medical University//Guangdong Provincial Key Laboratory of Medical Image Processing, Guangzhou 510515, China
2.School of Biomedical Engineering, Guangdong Medical University, Dongguan 523808, China
Objective We propose a Contrastive Regional Attention and Prior Knowledge-Infused U-Transformer model (CRAKUT) to address the challenges of imbalanced text distribution, lack of contextual clinical knowledge, and cross-modal information transformation to enhance the quality of generated radiology reports. Methods The CRAKUT model comprises 3 key components, including an image encoder that utilizes common normal images from the dataset for extracting enhanced visual features, an external knowledge infuser that incorporates clinical prior knowledge, and a U-Transformer that facilitates cross-modal information conversion from vision to language. The contrastive regional attention in the image encoder was introduced to enhance the features of abnormal regions by emphasizing the difference between normal and abnormal semantic features. Additionally, the clinical prior knowledge infuser within the text encoder integrates clinical history and knowledge graphs generated by ChatGPT. Finally, the U-Transformer was utilized to connect the multi-modal encoder and the report decoder in a U-connection schema, and multiple types of information were used to fuse and obtain the final report. Results We evaluated the proposed CRAKUT model on two publicly available CXR datasets (IU-Xray and MIMIC-CXR). The experimental results showed that the CRAKUT model achieved a state-of-the-art performance on report generation with a BLEU-4 score of 0.159, a ROUGE-L score of 0.353, and a CIDEr score of 0.500 in MIMIC-CXR dataset; the model also had a METEOR score of 0.258 in IU-Xray dataset, outperforming all the comparison models. Conclusion The proposed method has great potential for application in clinical disease diagnoses and report generation.
本研究的数据源自公开的胸片数据集IU-Xray和MIMIC-CXR,其中IU-Xray[6]数据集由印第安纳大学提供,是评估报告生成模型广泛使用的数据集。该数据集包含7470张胸部X光片图像,包括正位和侧位视图,以及3955份对应的影像报告。每份报告由印象、发现、说明等部分组成。MIMIC-CXR[5]数据集由Beth Israel Deaconess Medical Center提供,是目前最大的公开胸部X光片数据集,包括377 110张胸部X光片图像和227 835份报告,这些数据来自64 588例患者。与IU-Xray不同,MIMIC-CXR中的图像视图更加多样化,不同患者之间的差异显著。为了更公平地将我们的结果与其他研究进行比较,采用官方数据集划分方式,将数据集按7∶1∶2的比例划分为训练集、验证集和测试集。
RaoofS, FeiginD, SungA, et al. Interpretation of plain chest roentgenogram[J]. Chest, 2012, 141(2): 545-58. doi:10.1378/chest.10-1302
[2]
JingBY, XiePT, XingE. On the automatic generation of medical imaging reports[EB/OL]. 2017. doi:10.18653/v1/p18-1240
[3]
VinyalsO, ToshevA, BengioS, et al. Show and tell: a neural image caption generator[C]//2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015: 3156-64. doi:10.1109/cvpr.2015.7298935
[4]
LiuFL, YinCC, WuX, et al. Contrastive attention for automatic chest X-ray report generation[EB/OL]. 2021. doi:10.18653/v1/2021.findings-acl.23
[5]
JohnsonAEW, PollardTJ, GreenbaumNR, et al. MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs[EB/OL]. 2019. doi:10.1038/s41597-019-0322-0
[6]
Demner-FushmanD, KohliMD, RosenmanMB, et al. Preparing a collection of radiology examinations for distribution and retrieval[J]. J Am Med Inform Assoc, 2016, 23(2): 304-10. doi:10.1093/jamia/ocv080
[7]
BrownT, MannB, RyderN, et al. Language models are few-shot learners [J]. Adv Neural Information Processing Systems, 2020, 33:1877-901.
[8]
HuangZZ, ZhangXF, ZhangST. KiUT: knowledge-injected U-transformer for radiology report generation[C]//2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023: 19809-18. doi:10.1109/cvpr52729.2023.01897
[9]
NguyenHTN, NieD, BadamdorjT, et al. Automated generation of accurate & fluent medical X-ray reports[EB/OL]. 2021. doi:10.18653/v1/2021.emnlp-main.288
[10]
HuangG, LiuZ, Van Der MaatenL, et al. Densely connected convolutional networks[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017: 4700-8. doi:10.1109/cvpr.2017.243
[11]
HuangL, WangWM, ChenJ, et al. Attention on attention for image captioning[C]//2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019: 4634-43. doi:10.1109/iccv.2019.00473
[12]
VaswaniA, ShazeerN, ParmarN, et al. Polosukhin, "Attention is all you need"[J]. Adv Neural Information Processing Systems, 2017,30: 1305.
[13]
TranA, MathewsA, XieLX. Transform and tell: entity-aware news image captioning[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020: 13035-45. doi:10.1109/cvpr42600.2020.01305
[14]
PanYW, YaoT, LiYH, et al. X-linear attention networks for image captioning[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020: 10971-80. doi:10.1109/cvpr42600.2020.01098
[15]
CorniaM, StefaniniM, BaraldiL, et al. Meshed-memory transformer for image captioning[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020: 10578-87. doi:10.1109/cvpr42600.2020.01059
[16]
NguyenVQ, SuganumaM, OkataniT. GRIT: faster and better image captioning transformer using dual visual features[M]//Cham: Springer Nature Switzerland, 2022: 167-84. doi:10.1007/978-3-031-20059-5_10
JingBY, WangZY, XingE. Show, describe and conclude: on exploiting the structure information of chest X-ray reports[EB/OL]. 2020. doi:10.18653/v1/p19-1657
[19]
LiuG, HsuH, McDermottM, et al. Clinically accurate chest x-ray report generation[J]. PMLR, 2019, 106: 249-69.
[20]
ChenZH, ShenYL, SongY, et al. Cross-modal memory networks for radiology report generation[EB/OL]. 2022. doi:10.18653/v1/2021.acl-long.459
[21]
ChenZH, SongY, ChangTH, et al. Generating radiology reports via memory-driven transformer[EB/OL]. 2020. doi:10.18653/v1/2020.emnlp-main.112
[22]
LiM, LiuR, WangF, et al. Auxiliary signal-guided knowledge encoder-decoder for medical report generation[J]. World Wide Web, 2023, 26(1): 253-70. doi:10.1007/s11280-022-01013-6
LiuFL, WuX, GeS, et al. Exploring and distilling posterior and prior knowledge for radiology report generation[C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021: 13753-62. doi:10.1109/cvpr46437.2021.01354
[25]
ZhangY, WangX, XuZ, et al. When radiology report generation meets knowledge graph [C]// Proceedings of the AAAI Conference on Artificial Intelligence, 2020: 12910-7. doi:10.1609/aaai.v34i07.6989
[26]
YinCC, QianBY, WeiJS, et al. Automatic generation of medical imaging diagnostic report with hierarchical recurrent neural network[C]//2019 IEEE International Conference on Data Mining (ICDM), 2019: 728-37. doi:10.1109/icdm.2019.00083
[27]
YouD, LiuFL, GeS, et al. AlignTransformer: hierarchical alignment of visual regions and disease tags for medical report generation[C]//Medical Image Computing and Computer Assisted Intervention, 2021: 72-82. doi:10.1007/978-3-030-87199-4_7
[28]
WangZY, TangMK, WangL, et al. A medical semantic-assisted transformer for radiographic report generation[C]//Medical Image Computing and Computer Assisted Intervention, 2022: 655-64. doi:10.1007/978-3-031-16437-8_63
[29]
YangS, WuX, GeS, et al. Knowledge matters: Chest radiology report generation with general and specific knowledge[J]. Med Image Anal, 2022, 80: 102510. doi:10.1016/j.media.2022.102510
[30]
WangZY, LiuLQ, WangL, et al. METransformer: radiology report generation by transformer with multiple learnable expert tokens[C]//2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023: 11558-67. doi:10.1109/cvpr52729.2023.01112
[31]
LiM, LinB, ChenZ, et al. Dynamic graph enhanced contrastive learning for chest x-ray report generation[C]//2023 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023: 3334-43. doi:10.1109/cvpr52729.2023.00325
[32]
TanidaT, MüllerP, KaissisG, et al.Interactive and explainable region-guided radiology report generation[C]//IEEE/CVF Con-ference on Computer Vision and Pattern Recognition, 2023: 7433-42. doi:10.1109/cvpr52729.2023.00718