Enabling Large Language Models to Generate Text with Citations
https://papers.cool/arxiv/2305.14627
Authors: Tianyu Gao ; Howard Yen ; Jiatong Yu ; Danqi Chen
Summary: Large language models (LLMs) have emerged as a widely-used tool for information seeking, but their generated outputs are prone to hallucination. In this work, our aim is to allow LLMs to generate text with citations, improving their factual correctness and verifiability. Existing work mainly relies on commercial search engines and human evaluation, making it challenging to reproduce and compare different modeling approaches. We propose ALCE, the first benchmark for Automatic LLMs' Citation Evaluation. ALCE collects a diverse set of questions and retrieval corpora and requires building end-to-end systems to retrieve supporting evidence and generate answers with citations. We develop automatic metrics along three dimensions -- fluency, correctness, and citation quality -- and demonstrate their strong correlation with human judgements. Our experiments with state-of-the-art LLMs and novel prompting strategies show that current systems have considerable room for improvement -- For example, on the ELI5 dataset, even the best models lack complete citation support 50% of the time. Our analyses further highlight promising future directions, including developing better retrievers, advancing long-context LLMs, and improving the ability to synthesize information from multiple sources.
A: 这篇论文试图解决的问题是大型语言模型(LLMs)在生成文本时容易出现的错误信息(hallucination)问题。作者们的目标是允许LLMs在生成文本时提供引用(citations),以提高文本的事实正确性和可验证性。现有的工作主要依赖商业搜索引擎和人工评估,这使得不同建模方法的复现和比较变得具有挑战性。为了解决这个问题,作者提出了ALCE(Automatic LLMs’ Citation Evaluation)基准测试,这是第一个用于自动评估LLMs引用生成的基准。ALCE收集了多样化的问题和检索语料库,并要求构建端到端系统来检索支持证据并生成带有引用的答案。作者们开发了自动度量方法,涵盖了流畅性、正确性和引用质量三个维度,并展示了这些度量方法与人类判断的强相关性。通过与最先进的LLMs和新颖的提示策略的实验,作者们展示了当前系统在正确性和引用质量方面仍有显著的提升空间。
A: 相关研究主要集中在以下几个方面:
这些研究为本文提出的ALCE基准测试提供了背景和技术支持,同时也展示了在LLMs生成引用文本方面的研究进展。