ALCE | Notion

Enabling Large Language Models to Generate Text with Citations

Authors: Tianyu Gao ; Howard Yen ; Jiatong Yu ; Danqi Chen

Summary: Large language models (LLMs) have emerged as a widely-used tool for information seeking, but their generated outputs are prone to hallucination. In this work, our aim is to allow LLMs to generate text with citations, improving their factual correctness and verifiability. Existing work mainly relies on commercial search engines and human evaluation, making it challenging to reproduce and compare different modeling approaches. We propose ALCE, the first benchmark for Automatic LLMs' Citation Evaluation. ALCE collects a diverse set of questions and retrieval corpora and requires building end-to-end systems to retrieve supporting evidence and generate answers with citations. We develop automatic metrics along three dimensions -- fluency, correctness, and citation quality -- and demonstrate their strong correlation with human judgements. Our experiments with state-of-the-art LLMs and novel prompting strategies show that current systems have considerable room for improvement -- For example, on the ELI5 dataset, even the best models lack complete citation support 50% of the time. Our analyses further highlight promising future directions, including developing better retrievers, advancing long-context LLMs, and improving the ability to synthesize information from multiple sources.

Q: 这篇论文试图解决什么问题？

A: 这篇论文试图解决的问题是大型语言模型（LLMs）在生成文本时容易出现的错误信息（hallucination）问题。作者们的目标是允许LLMs在生成文本时提供引用（citations），以提高文本的事实正确性和可验证性。现有的工作主要依赖商业搜索引擎和人工评估，这使得不同建模方法的复现和比较变得具有挑战性。为了解决这个问题，作者提出了ALCE（Automatic LLMs’ Citation Evaluation）基准测试，这是第一个用于自动评估LLMs引用生成的基准。ALCE收集了多样化的问题和检索语料库，并要求构建端到端系统来检索支持证据并生成带有引用的答案。作者们开发了自动度量方法，涵盖了流畅性、正确性和引用质量三个维度，并展示了这些度量方法与人类判断的强相关性。通过与最先进的LLMs和新颖的提示策略的实验，作者们展示了当前系统在正确性和引用质量方面仍有显著的提升空间。

Q: 有哪些相关研究？

A: 相关研究主要集中在以下几个方面：

生成文本与引用（Citation）：研究如何让语言模型在生成文本时提供引用，以提高信息的准确性和可验证性。例如，Rashkin et al. (2023) 提出了“可归因于已识别来源”（AIS）的评分方法来衡量生成文本对其来源的忠实度。
检索增强的语言模型（Retrieval-Augmented LMs）：探索如何通过检索外部信息来增强语言模型。例如，Guu et al. (2020) 和 Borgeaud et al. (2022) 在预训练阶段将语言模型与检索到的段落结合，而 Khandelwal et al. (2020) 和 Zhong et al. (2022) 通过将LLMs的输出与最近邻（kNN）模块的结果插值来增强模型。
交互式检索（Interactive Retrieval）：研究如何让模型在生成过程中动态地调用检索。例如，Yao et al. (2023) 和 Schick et al. (2023) 提出了在生成过程中允许模型执行检索的方法。
长文本生成（Long-Form Text Generation）：研究如何生成长篇连贯的文本。例如，Karpukhin et al. (2020) 和 Petroni et al. (2021) 专注于开放域问答（Open-Domain QA）中的长文本生成。
语言模型的改进（Improving Language Models）：研究如何通过不同的方法改进语言模型，使其在特定任务上表现得更好。例如，Izacard and Grave (2021) 利用检索增强的生成模型来改进开放域问答。
语言模型的评估（Evaluating Language Models）：研究如何评估语言模型在特定任务上的性能。例如，Bohnet et al. (2022) 和 Honovich et al. (2022) 分别研究了AIS评分在语言模型评估中的应用。
信息检索（Information Retrieval）：研究如何有效地从大量数据中检索相关信息。例如，Ni et al. (2022) 提出了大型双编码器（Dense Passage Retrieval）作为通用检索器。

这些研究为本文提出的ALCE基准测试提供了背景和技术支持，同时也展示了在LLMs生成引用文本方面的研究进展。