https://papers.cool/arxiv/2312.06648
Authors: Tong Chen ; Hongwei Wang ; Sihao Chen ; Wenhao Yu ; Kaixin Ma ; Xinran Zhao ; Hongming Zhang ; Dong Yu
Summary: Dense retrieval has become a prominent method to obtain relevant context or world knowledge in open-domain NLP tasks. When we use a learned dense retriever on a retrieval corpus at inference time, an often-overlooked design choice is the retrieval unit in which the corpus is indexed, e.g. document, passage, or sentence. We discover that the retrieval unit choice significantly impacts the performance of both retrieval and downstream tasks. Distinct from the typical approach of using passages or sentences, we introduce a novel retrieval unit, proposition, for dense retrieval. Propositions are defined as atomic expressions within text, each encapsulating a distinct factoid and presented in a concise, self-contained natural language format. We conduct an empirical comparison of different retrieval granularity. Our results reveal that proposition-based retrieval significantly outperforms traditional passage or sentence-based methods in dense retrieval. Moreover, retrieval by proposition also enhances the performance of downstream QA tasks, since the retrieved texts are more condensed with question-relevant information, reducing the need for lengthy input tokens and minimizing the inclusion of extraneous, irrelevant information.
A: 这篇论文探讨了在开放域自然语言处理(NLP)任务中,使用密集检索(Dense Retrieval)方法时,检索粒度(retrieval granularity)的选择对检索性能和下游任务性能的影响。具体来说,论文提出了以下几个关键问题:
通过实验比较,论文发现基于命题的检索在密集检索任务和下游QA任务中显著优于传统的段落或句子方法,尤其是在泛化性能方面。此外,命题检索还能提高下游QA任务的性能,因为检索到的文本更加紧凑,包含了更多与问题相关的信息,减少了无关信息的干扰。
A: 这篇论文提到了与密集检索(Dense Retrieval)相关的几个研究领域和具体工作,包括: