RAGAS | Notion

RAGAS: Automated Evaluation of Retrieval Augmented Generation

Authors: Shahul Es ; Jithin James ; Luis Espinosa-Anke ; Steven Schockaert

Summary: We introduce RAGAs (Retrieval Augmented Generation Assessment), a framework for reference-free evaluation of Retrieval Augmented Generation (RAG) pipelines. RAG systems are composed of a retrieval and an LLM based generation module, and provide LLMs with knowledge from a reference textual database, which enables them to act as a natural language layer between a user and textual databases, reducing the risk of hallucinations. Evaluating RAG architectures is, however, challenging because there are several dimensions to consider: the ability of the retrieval system to identify relevant and focused context passages, the ability of the LLM to exploit such passages in a faithful way, or the quality of the generation itself. With RAGAs, we put forward a suite of metrics which can be used to evaluate these different dimensions \textit{without having to rely on ground truth human annotations}. We posit that such a framework can crucially contribute to faster evaluation cycles of RAG architectures, which is especially important given the fast adoption of LLMs.

Q: 这篇论文试图解决什么问题？

A: 这篇论文介绍了RAGAS（Retrieval Augmented Generation Assessment），一个用于无需参考标准即可评估检索增强生成（Retrieval Augmented Generation, RAG）系统框架的问题。RAG系统由检索模块和基于大型语言模型（LLM）的生成模块组成，它们能够从参考文本数据库中为LLM提供知识，使其能够在用户和文本数据库之间充当自然语言层，减少幻觉（hallucinations）的风险。然而，评估RAG架构面临挑战，因为需要考虑多个维度：检索系统识别相关和聚焦上下文段落的能力、LLM忠实利用这些段落的能力，以及生成本身的质量。RAGAS提出了一套无需依赖于人类标注的真值（ground truth）的指标，用于评估这些不同的维度，这对于快速评估RAG架构至关重要，尤其是在LLM快速被采用的背景下。

Q: 有哪些相关研究？

A: 相关研究主要集中在以下几个方面：

使用LLMs估计忠实度（faithfulness）：研究者们广泛研究了如何检测LLM生成响应中的幻觉（hallucinations）。一些方法提出了使用少量样本提示策略（few-shot prompting strategy）来预测事实性。然而，现有的模型在标准提示策略下检测幻觉方面存在困难。其他方法依赖于将生成的响应与外部知识库中的事实联系起来，但这并不总是可行的。还有一种策略是检查模型分配给单个标记的概率，期望模型在幻觉答案上的信心低于事实答案。
自动化评估文本生成系统：LLMs也被用于自动评估生成文本片段的其他方面，例如流畅性。例如，GPTScore使用指定考虑方面的提示，然后根据给定自回归LLM的平均概率对段落进行评分。其他方法直接要求ChatGPT评估给定答案的特定方面，通过提供0到100的分数或5星评级。
评估策略：在构建RAG系统时，通常没有人类标注的数据集或参考答案。因此，研究者们关注于完全自包含且无需参考的指标。特别关注三个质量方面：忠实度（faithfulness）、答案相关性（answer relevance）和上下文相关性（context relevance）。这些指标可以通过提示LLM以完全自动化的方式进行测量。
WikiEval数据集：为了评估提出的框架，研究者们创建了一个新的数据集WikiEval，其中包含了带有人类判断的问答-上下文三元组。这个数据集用于验证提出的指标与人类评估在忠实度、答案相关性和上下文相关性方面的一致性。

这些相关研究为RAGAS框架的开发提供了背景和基础，同时也展示了在自动化评估RAG系统方面的最新进展。

Q: 论文如何解决这个问题？

A: 论文通过以下步骤解决了评估检索增强生成（RAG）系统的问题：

提出RAGAS框架：RAGAS是一个无需参考标准即可评估RAG系统的框架。它提供了一套指标，用于评估RAG系统的不同维度，包括检索系统的上下文识别能力、LLM对这些上下文的利用能力，以及生成质量本身。
定义质量维度：论文明确了三个核心质量维度：忠实度（faithfulness）、答案相关性（answer relevance）和上下文相关性（context relevance）。这些维度有助于评估RAG系统在不同方面的表现。