Pre-trained language model-based legal question answering systems struggle to flexibly understand users' intent and lack the integration of external knowledge, making it difficult to achieve desired results. To address this, this paper proposes a fine-grained legal question answering dataset based on the criminal law articles library (FCL-QA). Based on FCL-QA, this paper proposes a Statutory Articles Retrieval Augmented Question Answering Framework (SaRAF) based on large language model. The core idea is to locate the category of the question through multi-level classification, narrow the scope of statutory articles through the category to facilitate retrieval, and finally generate the answer using a large language model. Experimental results show that the SaRAF outperforms both without statutory articles generation method and Retrieval-augmented Generation(RAG) method, achieving ROUGE-L F1 score of 42.27%, BLEU-4 score of 27.78% and BERTScore of 72.52% on the FCL-QA dataset.
目前,主流的法律问答数据集包括选择题形式的司法考试数据集JEC-QA(Judicial Examination of China Question Answering)[3]、抽取式的法律阅读理解数据集CJRC(Chinese Judicial Reading Comprehension)[4]与检索和生成相结合的司法摘要数据集。与开放领域相比,法律问答研究的发展较为滞后。一方面,现有的法律问答数据集与实际的用户需求相差较大。在现实场景下,用户的输入往往只有问题。而用户期望获得的,是自然语言形式的流畅回复。另一方面,主流的法律问答数据集缺乏法律知识的指导。在高度专业化的法律领域,法规法条是提高模型生成质量,监督模型生成准确可靠回复的有力工具。而现有的数据集往往缺乏对问题、答案与法条之间的对应关系的标注。此外,由于法律数据集的构建依赖于法律专家的人工标注,高质量的中文法律问答数据集较为稀缺。
在问答领域,深度学习的发展使得生成式的问答成为可能,Tan等[6]提出了S-Net(Synthesis Network)模型,采用抽取与生成相结合的范式,使用序列到序列模型(Sequence To Sequence,Seq2Seq)进行答案生成。预训练语言模型的出现为问答系统的发展带来了飞跃式的进步,Karpukhin等[7]提出了DPR(Dense Passage Retriever)模型,通过基于BERT(Bidirectional Encoder Representations from Transformers)模型微调来完成检索任务,取得了远超BM25(Best Matching 25)算法的效果。Garg等[8]提出了TANDA(Transfer and Adapt)模型,通过对预训练模型进行两次微调来提高模型的性能与鲁棒性。Roberts等[9]以闭卷的形式微调T5(Text-To-Text Transfer Transformer)模型,直接输入问题并获取对应的答案。Hsu等[1]提出了GenQA(Generative Question Answering)模型,在T5模型的基础上通过综合利用问题中信息与候选答案信息生成答案。
在生成式任务中,大模型展示出了卓越的性能,其代表性的工作是Touvron等[13]提出的LlaMA(Large Language Model Meta AI)模型。LlaMA模型使用注意力机制模型(Transformer)[14]架构,预测给定单词或元(token)作为下一个单词或token的概率。通过在1.4×1012个token上进行训练,LlaMA模型取得了强大的性能,在常识推理、闭卷问答、阅读理解、数学推理、代码生成与大规模语言任务理解中都取得了优秀的成果。
在LlaMA被提出后,许多工作都在LlaMA框架的基础上进行,大量开源的大语言模型被公布。Taori等[15]提出了Alpaca模型,在LlaMA模型的基础上使用指令数据进行了进一步的微调,取得了媲美GPT3.5(Generative Pre-trained Transformer 3.5)的水平。Chiang等[16]提出了Vicuna模型,通过收集ShareGPT网站上的数据来进行指令微调,在低成本的情况下达到了接近GPT的能力。Bai等[17]提出了千问(Qwen)模型,通过使用高达3万亿个token的数据进行预训练,为模型提供了可靠的知识源。在LlaMA的框架外,Du等[18]提出了ChatGLM(Chat Generative Language Model)模型,针对中文问答和对话进行了专门的优化,能够生成相当符合人类偏好的回答。
本文按9∶1的比例,对训练集与验证集进行划分。在主题预测任务中,训练集通过根据大类标签对训练集进行归类获得。主题预测、法条检索任务的验证集分别根据流程中上一步的结果获得。表1中为ChatGLM3-6B(Chat Generative Language Model 3-6 Billion Parameters)模型上的实验结果,不同模型间存在一定的误差。
DUANX Y, WANGB X, WANGZ Y, et al. CJRC: A Reliable Human-annotated Benchmark DataSet for Chinese Judicial Reading Comprehension[M]//Lecture Notes in Computer Science. Cham: Springer International Publishing, 2019: 439-451. DOI: 10.1007/978-3-030-32381-3_36 .
[5]
LOUISA, VAN DIJCKG, SPANAKISG. Interpretable Long-form Legal Question Answering with Retrieval-augmented Large Language Models[J]. Proc AAAI Conf Artif Intell, 2024, 38(20): 22266-22275. DOI: 10.1609/aaai.v38i20.30232 .
[6]
TANC Q, WEIF R, YANGN, et al. S-net: From Answer Extraction to Answer Synthesis for Machine Reading Comprehension[J]. Proc AAAI Conf Artif Intell, 2018, 32(1). DOI: 10.1609/aaai.v32i1.12035 .
[7]
KARPUKHINV, OĞUZB, MINS, et al. Dense Passage Retrieval for Open-domain Question Answering[EB/OL]. (2020-04-10)[2024-07-21].
[8]
GARGS, VU T, MOSCHITTIA. TANDA: Transfer and Adapt Pre-trained Transformer Models for Answer Sentence Selection[J]. Proc AAAI Conf Artif Intell, 2020, 34(5): 7780-7788. DOI: 10.1609/aaai.v34i05.6282 .
[9]
ROBERTSA, RAFFELC, SHAZEERN. How Much Knowledge Can You Pack into the Parameters of a Language Model?[EB/OL]. (2020-02-10)[2024-07-21].
[10]
WANGZ Y, WANGB X, DUANX Y, et al. IFlyLegal: a Chinese Legal System for Consultation, Law Searching, and Document Analysis[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations. Stroudsburg, PA, USA: Association for Computational Linguistics, 2019: 97-102. DOI: 10.18653/v1/d19-3017 .
[11]
HOPPEC, PELKMANND, MIGENDAN, et al. Towards Intelligent Legal Advisors for Document Retrieval and Question-answering in German Legal Documents[C]//2021 IEEE Fourth International Conference on Artificial Intelligence and Knowledge Engineering (AIKE). New York: IEEE, 2021: 29-32. DOI: 10.1109/AIKE52691.2021.00011 .
[12]
KIENP M, NGUYENH T, BACHN X, et al. Answering Legal Questions by Learning Neural Attentive Text Representation[C]//Proceedings of the 28th International Conference on Computational Linguistics. Stroudsburg, PA, USA: International Committee on Computational Linguistics, 2020: 988-998. DOI: 10.18653/v1/2020.coling-main.86 .
[13]
TOUVRONH, LAVRILT, IZACARDG, et al. LLaMA: Open and Efficient Foundation Language Models[EB/OL]. (2023-02-27)[2024-07-21].
[14]
VASWANIA, SHAZEERN, PARMARN, et al. Attention Is All You Need[EB/OL]. (2017-06-12)[2024-07-21].
[15]
TAORIR, GULRAJANII, ZHANGT, et al. Alpaca: A Strong, Replicable Instruction-following Model[EB/OL]. (2023-03-13)[2024-07-21].
[16]
CHIANGW L, LIZ, LINZ, et al. Vicuna: an Open-source Chatbot Impressing Gpt-4 with 90%* Chatgpt Quality[EB/OL]. (2023-03-30)[2024-07-21].
DUZ X, QIANY J, LIUX, et al. GLM: General Language Model Pretraining with Autoregressive Blank Infilling[EB/OL]. (2021-03-18)[2024-07-21].
[19]
HUE J, SHENY L, WALLISP, et al. LoRA: Low-rank Adaptation of Large Language Models[EB/OL]. (2021-06-17)[2024-07-21].
[20]
CAIX X, XIAOM, NINGZ Y, et al. Resolving the Imbalance Issue in Hierarchical Disciplinary Topic Inference via LLM-based Data Augmentation[C]//2023 IEEE International Conference on Data Mining Workshops (ICDMW). New York: IEEE, 2023: 1424-1429. DOI: 10.1109/ICDMW60847.2023.00181 .
[21]
REIMERSN, GUREVYCHI. Sentence-BERT: Sentence Embeddings Using Siamese BERT-networks[EB/OL]. (2019-08-27)[2024-07-21].