Textbook Visual Question Answering is a multi-modal task in the field of smart education that requires a deep understanding of textbook images, text, and questions to infer the correct answers. However, existing generic Visual Question Answering methods perform poorly in this task. The main reasons are as follows: Firstly, these methods can only simply recognize object attributes, lack disciplinary information, and are susceptible to interference from redundant information unrelated to the questions. Secondly, they struggle to capture key information in the texts. To solve these problems, a textbook visual question-answering method based on image description enhancement is proposed, which mainly includes three modules: (1) Text encoding and understanding: Utilizing large language models to extract keywords from questions and retrieve relevant statements in the text related to the question keywords to enhance text understanding and eliminate interference from redundant informations. (2) Image encoding and description: Employing a question-image attention mechanism in image descriptions to generate fine-grained image description statements constrained by questions based on question keywords, thereby enhancing image understanding ability. (3) Answer prediction: using a pre-trained visual-language model to fuse text information with visual information to improve the model's reasoning ability. Experimental results on relevant datasets demonstrate that the proposed method effectively improves the understanding of textbook information, thereby enhancing answer prediction accuracy. The accuracy of the test set and the verification set was improved by 1.82% and 1.72%, respectively.
LXMERT(Language X Multimodal ERiT)[26]是一个大规模Transformer模型,它有三个编码器:对象关系编码器、语言编码器和跨模态编码器。利用掩码语言建模、掩码目标预测、跨模态匹配、图像问答等预训练任务,在大量“图像句子对”数据集上对模型进行预训练。
BLIP(Bootstrapped Language-image Pretraining )[27]是一种encoder-decoder混合多模态结构。可以灵活地在视觉理解任务上和生成任务上面迁移,BLIP通过自展标注(bootstrapping the captions),可以有效地利用带有噪声的 web 数据,其中标注器(captioner)生成标注,过滤器(filter)去除有噪声的标注。
BLIP-2[28]利用预训练的视觉模型和语言模型来提升多模态效果和降低训练成本。预训练的视觉模型能够提供高质量的视觉表征,预训练的语言模型则提供了强大的语言生成能力依赖。模型由预训练的Image Encoder,预训练的Large Language Model,和一个可学习的 Q-Former 组成。
KEMBHAVIA, SEOM, SCHWENKD, et al. Are you Smarter than a Sixth Grader? Textbook Question Answering for Multimodal Machine Comprehension[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2017: 5376-5384. DOI: 10.1109/CVPR.2017.571 .
[2]
ANTOLS, AGRAWALA, LUJ S, et al. VQA: Visual Question Answering[C]//2015 IEEE International Conference on Computer Vision (ICCV). New York: IEEE, 2015: 2425-2433. DOI: 10.1109/ICCV.2015.279 .
[3]
CAOQ X, LIANGX D, LIB L, et al. Interpretable Visual Question Answering by Reasoning on Dependency Trees[J]. IEEE Trans Pattern Anal Mach Intell, 2021, 43(3): 887-901. DOI: 10.1109/TPAMI.2019.2943456 .
LIJ, SUH, ZHUJ, et al. Textbook Question Answering Under Instructor Guidance with Memory Networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 3655-3663.
[9]
KIMD, KIMS, KWAKN. Textbook Question Answering with Multi-modal Context Graph Understanding and Self-supervised Open-set Comprehension[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA, USA: Association for Computational Linguistics, 2019: 3568-3584. DOI: 10.18653/v1/p19-1347 .
[10]
KIPFT N, WELLINGM. Semi-supervised classification with graph convolutional networks[C]//International Conference on Learning Representations. Toulon: ICLR, 2017.
[11]
MAJ, LIUJ, WANGY X, et al. Relation-aware Fine-grained Reasoning Network for Textbook Question Answering[J]. IEEE Trans Neural Netw Learn Syst, 2023, 34(1): 15-27. DOI: 10.1109/TNNLS.2021.3089140 .
[12]
GOMEZ-PEREZJ M, ORTEGAR. ISAAQ - Mastering Textbook Questions with Pre-trained Transformers and Bottom-up and Top-down Attention[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Stroudsburg, PA, USA: Association for Computational Linguistics, 2020: 5469-547. DOI: 10.18653/v1/2020.emnlp-main.441 .
[13]
VASWANIA, SHAZEERN, PARMARN, et al. Attention is All You Need[J]. NeurIPS, 2017, 30. DOI: 10.48550/arXiv.1706.03762 .
[14]
VINYALSO, TOSHEVA, BENGIOS, et al. Show and Tell: a Neural Image Caption Generator[C]//2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2015: 3156-3164. DOI: 10.1109/CVPR.2015.7298935 .
[15]
MAOJ, XUW, YANGY, et al. Deep Captioning with Multimodal Recurrent Neural Networks[C]//International Conference on Learning Representations. San Diego: ICLR, 2015.
[16]
SILVERD, HUANGA, MADDISONC J, et al. Mastering the Game of Go with Deep Neural Networks and Tree Search[J]. Nature, 2016, 529(7587): 484-489. DOI: 10.1038/nature16961 .
[17]
ZAREMBAW, SUTSKEVERI, VINYALSO. Recurrent Neural Network Regularization[C]//International Conference on Learning Representations. San Diego: ICLR, 2015.
[18]
PANY W, YAOT, LIY H, et al. X-linear Attention Networks for Image Captioning[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2020: 10968-10977. DOI: 10.1109/CVPR42600.2020.01098 .
[19]
GUOL T, LIUJ, ZHUX X, et al. Normalized and Geometry-aware Self-attention Network for Image Captioning[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2020: 10324-10333. DOI: 10.1109/CVPR42600.2020.01034 .
[20]
DEVLINJ, CHANGM W, LEEK, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2019: 4171-4186.
[21]
WUT Y, HES Z, LIUJ P, et al. A Brief Overview of ChatGPT: The History, Status Quo and Potential Future Development[J]. IEEE/CAA J Autom Sin, 2023, 10(5): 1122-1136. DOI: 10.1109/JAS.2023.123618 .
[22]
DEVLINJ, CHANGM W, LEEK, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2019: 4171-4186.
TEWELY, SHALEVY, SCHWARTZI, et al. ZeroCap: Zero-shot Image-to-text Generation for Visual-semantic Arithmetic[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2022: 17897-17907. DOI: 10.1109/CVPR52688.2022.01739 .
[25]
SUW, ZHUX, CAOY, et al. VL-BERT: Pre-training of Generic Visual-linguistic Representations[C]//International Conference on Learning Representations. Addis Ababa: ICLR, 2020.
[26]
TANH, BANSALM. LXMERT: Learning Cross-modality Encoder Representations from Transformers[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2019: 5099-5110.
[27]
LIJ, LID, XIONGC, et al. BLIP: Bootstrapping Language-image Pre-training for Unified Vision-language Understanding and Generation[C]//International Conference on Machine Learning. Baltimore: PMLR, 2022: 18887-18900.
[28]
LIJ, LID, SAVARESES, et al. BLIP-2: Bootstrapping Language-image pre-training with Frozen Image Encoders and Large Language Models[C]//International Conference on Machine Learning. Honolulu: PMLR, 2023: 23018-23040.