Benchmarking Large Language Models in Retrieval-Augmented Generation
https://papers.cool/arxiv/2309.01431
Authors: Jiawei Chen ; Hongyu Lin ; Xianpei Han ; Le Sun
Summary: Retrieval-Augmented Generation (RAG) is a promising approach for mitigating the hallucination of large language models (LLMs). However, existing research lacks rigorous evaluation of the impact of retrieval-augmented generation on different large language models, which make it challenging to identify the potential bottlenecks in the capabilities of RAG for different LLMs. In this paper, we systematically investigate the impact of Retrieval-Augmented Generation on large language models. We analyze the performance of different large language models in 4 fundamental abilities required for RAG, including noise robustness, negative rejection, information integration, and counterfactual robustness. To this end, we establish Retrieval-Augmented Generation Benchmark (RGB), a new corpus for RAG evaluation in both English and Chinese. RGB divides the instances within the benchmark into 4 separate testbeds based on the aforementioned fundamental abilities required to resolve the case. Then we evaluate 6 representative LLMs on RGB to diagnose the challenges of current LLMs when applying RAG. Evaluation reveals that while LLMs exhibit a certain degree of noise robustness, they still struggle significantly in terms of negative rejection, information integration, and dealing with false information. The aforementioned assessment outcomes indicate that there is still a considerable journey ahead to effectively apply RAG to LLMs.
A: 这篇论文试图解决的问题是如何评估和改进大型语言模型(LLMs)在检索增强生成(Retrieval-Augmented Generation, RAG)方面的性能。具体来说,论文关注以下几个方面:
总的来说,论文的目标是通过系统地评估和分析,为LLMs在RAG方面的改进提供指导,并推动相关技术的发展。
A: 这篇论文提到了与检索增强生成(RAG)和大型语言模型(LLMs)相关的一些研究,主要包括以下几个方面: