RGB | Notion

Benchmarking Large Language Models in Retrieval-Augmented Generation

Authors: Jiawei Chen ; Hongyu Lin ; Xianpei Han ; Le Sun

Summary: Retrieval-Augmented Generation (RAG) is a promising approach for mitigating the hallucination of large language models (LLMs). However, existing research lacks rigorous evaluation of the impact of retrieval-augmented generation on different large language models, which make it challenging to identify the potential bottlenecks in the capabilities of RAG for different LLMs. In this paper, we systematically investigate the impact of Retrieval-Augmented Generation on large language models. We analyze the performance of different large language models in 4 fundamental abilities required for RAG, including noise robustness, negative rejection, information integration, and counterfactual robustness. To this end, we establish Retrieval-Augmented Generation Benchmark (RGB), a new corpus for RAG evaluation in both English and Chinese. RGB divides the instances within the benchmark into 4 separate testbeds based on the aforementioned fundamental abilities required to resolve the case. Then we evaluate 6 representative LLMs on RGB to diagnose the challenges of current LLMs when applying RAG. Evaluation reveals that while LLMs exhibit a certain degree of noise robustness, they still struggle significantly in terms of negative rejection, information integration, and dealing with false information. The aforementioned assessment outcomes indicate that there is still a considerable journey ahead to effectively apply RAG to LLMs.

Q: 这篇论文试图解决什么问题？

A: 这篇论文试图解决的问题是如何评估和改进大型语言模型（LLMs）在检索增强生成（Retrieval-Augmented Generation, RAG）方面的性能。具体来说，论文关注以下几个方面：

评估RAG对LLMs的影响：现有研究缺乏对RAG在不同LLMs上影响的严格评估，这使得难以识别RAG在不同模型中的潜在瓶颈。
建立评估基准：为了全面评估LLMs在RAG中的能力，作者创建了一个新的评估基准（Retrieval-Augmented Generation Benchmark, RGB），用于评估英文和中文的RAG。
分析LLMs的基本能力：RGB基准测试了LLMs在四个基本能力上的表现，包括噪声鲁棒性（Noise Robustness）、负面拒绝（Negative Rejection）、信息整合（Information Integration）和反事实鲁棒性（Counterfactual Robustness）。
诊断当前LLMs的挑战：通过对6个代表性LLMs的评估，作者揭示了当前LLMs在应用RAG时面临的挑战，特别是在负面拒绝、信息整合和处理错误信息方面。
提出改进方向：评估结果表明，尽管LLMs在噪声鲁棒性方面表现出一定程度的能力，但在其他方面仍有显著挑战。这强调了在有效应用RAG到LLMs之前，还需要进一步解决重要问题，并谨慎设计其使用。

总的来说，论文的目标是通过系统地评估和分析，为LLMs在RAG方面的改进提供指导，并推动相关技术的发展。

Q: 有哪些相关研究？

A: 这篇论文提到了与检索增强生成（RAG）和大型语言模型（LLMs）相关的一些研究，主要包括以下几个方面：

RAG方法的发展：论文提到了Guu et al. (2020), Lewis et al. (2020), Borgeaud et al. (2022), Izacard et al. (2022), Shi et al. (2023), Ren et al. (2023) 等研究，这些工作探讨了如何通过检索外部知识来提高LLMs的准确性和可靠性。
LLMs的评估：论文讨论了对LLMs进行评估的重要性，提到了GLUE (Wang et al. 2019b), SuperCLUE (Wang et al. 2019a), MMLU (Hendrycks et al. 2021) 等基准测试，这些测试旨在衡量LLMs在自然语言理解等方面的能力。
LLMs的一般评估：随着LLMs的发展，出现了一系列通用评估基准，如AGIEval (Zhong et al. 2023), C-Eval (Huang et al. 2023), AlpacaEval (Li et al. 2023b), OpenLLM Leaderboard (Edward Beeching 2023) 等，这些基准测试关注于评估LLMs的通用能力。
特定能力的评估：除了通用能力评估外，还有一些研究专注于评估模型的特定能力，例如CValues (Xu et al. 2023a) 关注LLMs的安全性和责任感，M3Exam (Zhang et al. 2023) 关注人类考试，ToolBench (Qin et al. 2023) 评估LLMs使用外部工具的能力。