基于多源持续预训练与集成检索增强生成的矿产勘查大语言模型构建
张雨昂 , 谢忠 , 田苗 , 吴麒瑞 , 吴亮 , 邱芹军 , 陈建国
地球科学 ›› 2026, Vol. 51 ›› Issue (03) : 1025 -1039.
基于多源持续预训练与集成检索增强生成的矿产勘查大语言模型构建
A Large Language Model for Mineral Exploration via Multi-Source Continual Pre-Training and Integrated Retrieval-Augmented Generation
,
,
为解决矿产勘查场景下通用大语言模型领域语料稀缺、领域术语覆盖与语体适配不足、事实性幻觉突出的问题,构建约2 500万token规模的领域语料库,在此基础上提出课程式持续预训练策略,按术语、机制、案例三阶段组织训练数据,并配合渐进式Transformer block解冻与学习率调度,对Qwen3-1.7B进行持续预训练以实现分阶段领域适配,得到面向矿产勘查场景的大语言模型Geo-MineLLM;推理阶段集成Hybrid RAG,以混合检索与证据约束生成提升事实一致性.人工评估表明,Geo-MineLLM相较基座模型与同系列更大参数规模的模型能显著提升领域问答表现;集成Hybrid RAG后,综合领域问答表现接近GPT-4.1.该训练、推理一体化方案为矿产勘查领域大模型构建与可靠问答提供了轻量化路径.
To address the challenges faced by general-purpose large language models in mineral exploration, including scarcity of domain corpora, insufficient coverage of domain terminology and register adaptation, and pronounced factual hallucinations. We constructed a mineral-exploration corpus of approximately 25 million tokens and, on this basis, proposed a curriculum-based continual pre-training strategy, which organizes training data into three stages: terminology, mechanisms, and cases. Coupled with gradual unfreezing of Transformer blocks and learning-rate scheduling, we conducted continual pre-training of Qwen3-1.7B to achieve stage-wise domain adaptation, resulting in a mineral-exploration-oriented LLM, Geo-MineLLM. During inference, we integrated a Hybrid RAG framework, leveraging hybrid retrieval and evidence-constrained generation to enhance factual consistency. Human evaluation indicates that Geo-MineLLM substantially improves domain question-answering performance relative to the base model and larger-parameter models within the same family. With Hybrid RAG enabled, overall domain QA performance approaches that of GPT-4.1. The proposed training-inference integrated framework provides a lightweight pathway for building mineral-exploration LLMs and enabling reliable domain-specific question answering.
| [1] |
Bengio, Y., Louradour, J., Collobert, R., et al., 2009. Curriculum Learning. The 26th Annual International Conference on Machine Learning. Montreal. https://doi.org/10.1145/1553374.1553380 |
| [2] |
Cheng, Q. M., 2025. A New Paradigm for Mineral Resource Prediction Based on Human Intelligence⁃Artificial Intelligence Integration. Earth Science Frontiers, 32(4): 1-19 (in Chinese with English abstract). |
| [3] |
Cormack, G. V., Clarke, C. L. A., Buettcher, S., 2009. Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods. The 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval. Boston. https://doi.org/10.1145/1571941.1572114 |
| [4] |
Deng, C., Zhang, T. H., He, Z. M., et al., 2024. K2: A Foundation Language Model for Geoscience Knowledge Understanding and Utilization. The 17th ACM International Conference on Web Search and Data Mining. Merida. https://doi.org/10.1145/3616855.3635772 |
| [5] |
Farquhar, S., Kossen, J., Kuhn, L., et al., 2024. Detecting Hallucinations in Large Language Models Using Semantic Entropy. Nature, 630(8017): 625-630. https://doi.org/10.1038/s41586⁃024⁃07421⁃0 |
| [6] |
Fu, Y., Wang, M. G., Wang, C. B., et al., 2025. GeoMinLM: A Large Language Model in Geology and Mineral Survey in Yunnan Province. Ore Geology Reviews, 182: 106638. https://doi.org/10.1016/j.oregeorev.2025.106638 |
| [7] |
Gupta, K., Thérien, B., Ibrahim, A., et al., 2023. Continual Pre⁃Training of Large Language Models: How to (Re) Warm Your Model? ICML2023, Hawaii. https://doi.org/10.48550/arXiv.2308.04014 |
| [8] |
Gururangan, S., Marasović, A., Swayamdipta, S., et al., 2020. Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks. The 58th Annual Meeting of the Association for Computational Linguistics, Online. https://doi.org/10.18653/v1/2020.acl⁃main.740 |
| [9] |
He, H., Ma, C., Ye, S., et al., 2024. Low Resource Chinese Geological Text Named Entity Recognition Based on Prompt Learning. Journal of Earth Science, 35(3): 1035-1043. https://doi.org/10.1007/s12583⁃023⁃1944⁃8 |
| [10] |
Hou, X. Y., Zhao, Y. J., Liu, Y., et al., 2024. Large Language Models for Software Engineering: A Systematic Literature Review. ACM Transactions on Software Engineering and Methodology, 33(8): 1-79. https://doi.org/10.1145/3695988 |
| [11] |
Howard, J., Ruder, S., 2018. Universal Language Model Fine⁃Tuning for Text Classification. The 56th Annual Meeting of the Association for Computational Linguistics. Melbourne. https://doi.org/10.18653/v1/p18⁃1031 |
| [12] |
Jawahar, G., Sagot, B., Seddah, D., 2019. What Does BERT Learn about the Structure of Language? The 57th Annual Meeting of the Association for Computational Linguistics, Florence. https://doi.org/10.18653/v1/P19⁃1356 |
| [13] |
Ji, Z. W., Lee, N., Frieske, R., et al., 2023. Survey of Hallucination in Natural Language Generation. ACM Computing Surveys, 55(12): 1-38. https://doi.org/10.1145/3571730 |
| [14] |
Karpukhin, V., Oguz, B., Min, S., et al., 2020. Dense Passage Retrieval for Open⁃Domain Question Answering. The 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online. https://doi.org/10.18653/v1/2020.emnlp⁃main.550 |
| [15] |
Lachowycz, S., 2024. Utility of Artificial Intelligence in Geoscience. Nature Geoscience, 17(10): 953-955. https://doi.org/10.1038/s41561⁃024⁃01548⁃5 |
| [16] |
Lewis, P., Perez, E., Piktus, A., et al., 2020. Retrieval⁃Augmented Generation for Knowledge⁃Intensive NLP Tasks. arXiv, 2005.11401. https://arxiv.org/abs/2005.11401 |
| [17] |
Liu, C. P., Yang, H. M., Duan, R. C., et al., 2014. Metallogenic Age of the Matoutan Gold Deposit in East Tianshan and Its Geological Significance. Geological Bulletin of China, 33(6): 912-923 (in Chinese with English abstract). |
| [18] |
Qiu, Q. J., Tian, M., Xie, Z., et al., 2023a. Extracting Named Entity Using Entity Labeling in Geological Text Using Deep Learning Approach. Journal of Earth Science, 34(5): 1406-1417. https://doi.org/10.1007/s12583⁃022⁃1789⁃8 |
| [19] |
Qiu, Q. J., Wang, B., Ma, K., et al., 2023b. A Practical Approach to Constructing a Geological Knowledge Graph: A Case Study of Mineral Exploration Data. Journal of Earth Science, 34(5): 1374-1389. https://doi.org/10.1007/s12583⁃023⁃1809⁃3 |
| [20] |
Qiu, Q. J., Wu, L., Ma, K., et al., 2023. A Knowledge Graph Construction Method for Geohazard Chain for Disaster Emergency Response. Earth Science, 48(5): 1875-1891 (in Chinese with English abstract). |
| [21] |
Raffel, C., Shazeer, N., Roberts, A., et al., 2020. Exploring the Limits of Transfer Learning with a Unified Text⁃to⁃Text Transformer. Journal of Machine Learning Research, 21(140):1-67. |
| [22] |
Robertson, S., Zaragoza, H., 2009. The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends® in Information Retrieval, 3(4): 333-389. https://doi.org/10.1561/1500000019 |
| [23] |
Shi, L. Y., Zuo, R. G., 2026. Foundation Model for Mineral Prospectivity Mapping. Earth Science, 53(3): 832-848 (in Chinese with English abstract). |
| [24] |
Wu, G., Wang, H. T., Zhang, K. Y., et al., 2025. GeoProspect: A Domain⁃Specific Geological Large Language Model with Enhanced Continual Learning. Neurocomputing, 650: 130801. https://doi.org/10.1016/j.neucom.2025.130801 |
| [25] |
Wu, S. J., Irsoy, O., Lu, S., et al., 2023. BloombergGPT: A Large Language Model for Finance. arXiv, 2303.17564. https://arxiv.org/abs/2303.17564 |
| [26] |
Yang, X., Chen, A. K., PourNejatian, N., et al., 2022. A Large Language Model for Electronic Health Records. NPJ Digital Medicine, 5: 194. https://doi.org/10.1038/s41746⁃022⁃00742⁃2 |
| [27] |
Zhang, B. Y., Tang, J. C., Zhang, T. Y., et al., 2026. Knowledge Graph and Question⁃Answering Model for Geological Prospecting Empowered by Large Language Models. Earth Science, 53(3): 982-995 (in Chinese with English abstract). |
| [28] |
Zhang, K. P., Ma, L., Cui, B. B., et al., 2024a. Visual Large Language Model for Wheat Disease Diagnosis in the Wild. Computers and Electronics in Agriculture, 227: 109587. https://doi.org/10.1016/j.compag.2024.109587 |
| [29] |
Zhang, Y. F., Wei, C., He, Z. T., et al., 2024b. GeoGPT: An Assistant for Understanding and Processing Geospatial Tasks. International Journal of Applied Earth Observation and Geoinformation, 131: 103976. https://doi.org/10.1016/j.jag.2024.103976 |
| [30] |
Zhou, B., Li, K., 2025. Fusing Geoscience Large Language Models and Lightweight RAG for Enhanced Geological Question Answering. Geosciences, 15(10): 382. https://doi.org/10.3390/geosciences15100382 |
| [31] |
Zuo, R. G., Cheng, Q. M., Xu, Y., et al., 2024. Explainable Artificial Intelligence Models for Mineral Prospectivity Mapping. Scientia Sinica (Terrae), 54(9): 2917-2928 (in Chinese with English abstract). |
国家自然科学基金项目(42301492)
国家自然科学基金项目(42571487)
国家重点研发计划项目(2023YFC2906404)
国家重点研发计划项目(2023YFC2906400)
/
| 〈 |
|
〉 |