Leveraging the results of auxiliary tasks such as image classification as prior information can facilitate the generation of high-quality descriptions for remote sensing images. However, the existing methods that employ feature fusion techniques often fail to capture the complex interactions between features and are unable to fully describe the content within remote sensing images. To address these limitations, a remote sensing image captioning method based on multimodal semantic feature fusion is proposed. The image region features are extracted firstly by using a pre-trained ResNet50 network. Then, the semantic attributes of the image are predicted through a multilayer perceptron network. Subsequently, attribute-guided and text-guided cross-attention submodules are employed to enable the interaction and fusion of image, attribute, and text features. Finally, the fused features are fed into a decoder to generate the target image description. Experimental results demonstrate that the proposed method outperforms baseline methods in various evaluation metrics, yielding more accurate and coherent descriptions.
KOTARIDISI, LAZARIDOUM. Remote sensing image segmentation advances: a meta-analysis[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2021,173:309-322.
[5]
DAS S, JAINL, DAS A. Deep learning for military image captioning[C]∥Proceedings of the 2018 21st International Conference on Information Fusion. Piscataway, USA: IEEE, 2018:2165-2171.
[6]
LIUQ R, RUANC Q, ZHONGS, et al. Risk assessment of storm surge disaster based on numerical models and remote sensing[J]. International Journal of Applied Earth Observation and Geoinformation, 2018,68:20-30.
[7]
WANGS, YEX T, GUY, et al. Multi-label semantic feature fusion for remote sensing image captioning[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2022,184:1-18.
[8]
YANGQ Q, NIZ H, RENP. Meta captioning: a meta learning based remote sensing image captioning framework[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2022,186:190-200.
[9]
MENGL W, WANGJ, YANGY, et al. Prior knowledge-guided transformer for remote sensing image captioning[J]. IEEE Transactions on Geoscience and Remote Sensing, 2023,61:No.4706213.
[10]
LIUC Y, ZHAOR, SHIZ W. Remote-sensing image captioning based on multilayer aggregated transformer[J]. IEEE Geoscience and Remote Sensing Letters, 2022,19:No.6506605.
[11]
YANGZ G, LIQ, YUANY, et al. HCNet: hierarchical feature aggregation and cross-modal feature alignment for remote sensing image captioning[J]. IEEE Transactions on Geoscience and Remote Sensing, 2024,62:No.5624711.
[12]
QUB, LIX L, TAOD C, et al. Deep semantic understanding of high resolution remote sensing image[C]∥Proceedings of the 2016 International Conference on Computer, Information and Telecommunication Systems. Piscataway, USA: IEEE, 2016. DOI:10.1109/CITS.2016.7546397 .
[13]
LUX Q, WANGB Q, ZHENGX T, et al. Exploring models and data for remote sensing image caption generation[J]. IEEE Transactions on Geoscience and Remote Sensing, 2018,56(4):2183-2195.
[14]
PAPINENIK, ROUKOSS, WARDT, et al. BLEU: a method for automatic evaluation of machine translation[C]∥Proceedings of the Annual Meeting of the Association for Computational Linguistics. Stroudsburg, USA: ACL, 2002:311-318.