DenseX | Notion

Dense X Retrieval: What Retrieval Granularity Should We Use?

Authors: Tong Chen ; Hongwei Wang ; Sihao Chen ; Wenhao Yu ; Kaixin Ma ; Xinran Zhao ; Hongming Zhang ; Dong Yu

Summary: Dense retrieval has become a prominent method to obtain relevant context or world knowledge in open-domain NLP tasks. When we use a learned dense retriever on a retrieval corpus at inference time, an often-overlooked design choice is the retrieval unit in which the corpus is indexed, e.g. document, passage, or sentence. We discover that the retrieval unit choice significantly impacts the performance of both retrieval and downstream tasks. Distinct from the typical approach of using passages or sentences, we introduce a novel retrieval unit, proposition, for dense retrieval. Propositions are defined as atomic expressions within text, each encapsulating a distinct factoid and presented in a concise, self-contained natural language format. We conduct an empirical comparison of different retrieval granularity. Our results reveal that proposition-based retrieval significantly outperforms traditional passage or sentence-based methods in dense retrieval. Moreover, retrieval by proposition also enhances the performance of downstream QA tasks, since the retrieved texts are more condensed with question-relevant information, reducing the need for lengthy input tokens and minimizing the inclusion of extraneous, irrelevant information.

Untitled

Q: 这篇论文试图解决什么问题？

A: 这篇论文探讨了在开放域自然语言处理（NLP）任务中，使用密集检索（Dense Retrieval）方法时，检索粒度（retrieval granularity）的选择对检索性能和下游任务性能的影响。具体来说，论文提出了以下几个关键问题：

检索单元的选择：在实际应用中，密集检索模型通常基于预先确定的检索单元（如文档、固定长度的段落或句子）进行训练。论文研究了在推理时选择不同的检索单元（如段落、句子或命题）对检索性能的影响。
命题作为新的检索单元：与传统的段落或句子检索单元不同，论文提出了使用命题（proposition）作为新的检索单元。命题被定义为文本中的原子表达，每个命题封装了一个独特的事实片段，并以简洁、自包含的自然语言格式呈现。
检索粒度对下游任务的影响：论文研究了不同检索粒度对下游问答（QA）任务性能的影响，特别是在有限的输入标记长度约束下。
检索性能的泛化：论文探讨了在不同检索粒度下，密集检索器在不同任务上的泛化能力，特别是在训练数据集之外的任务上。
信息密度和检索效率：论文分析了不同检索单元在提供与问题相关信息的密度方面的差异，并探讨了这种差异如何影响检索效率和下游任务的性能。

通过实验比较，论文发现基于命题的检索在密集检索任务和下游QA任务中显著优于传统的段落或句子方法，尤其是在泛化性能方面。此外，命题检索还能提高下游QA任务的性能，因为检索到的文本更加紧凑，包含了更多与问题相关的信息，减少了无关信息的干扰。

Q: 有哪些相关研究？

A: 这篇论文提到了与密集检索（Dense Retrieval）相关的几个研究领域和具体工作，包括：

双编码器架构（Dual-Encoder Architecture）：许多最近的密集检索器采用了双编码器架构，其中每个查询和文档都被编码成低维特征向量，并通过非参数相似性函数来衡量它们的相关性。相关工作包括Yih et al. (2011), Reimers and Gurevych (2019), Karpukhin et al. (2020), Ni et al. (2022)。
多向量检索（Multi-Vector Retrieval）：一些研究通过学习将候选检索单元编码成多个向量来提高模型的表达能力和检索粒度，例如ColBERT (Khattab and Zaharia, 2020), DensePhrase (Lee et al., 2021a,b), ME-BERT (Luan et al., 2021), MVR (Zhang et al., 2022)。
检索增强生成（Retrieval-Augmented Generation）：这是一个新兴的范式，用于开放域问答任务，其中检索模型首先检索文本单元，然后将其作为输入与查询一起传递给生成模型以得出最终答案。相关工作包括Lewis et al. (2021), Jiang et al. (2023), Asai et al. (2023)。