Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions

https://papers.cool/arxiv/2212.10509

Authors: Harsh Trivedi ; Niranjan Balasubramanian ; Tushar Khot ; Ashish Sabharwal

Summary: Prompting-based large language models (LLMs) are surprisingly powerful at generating natural language reasoning steps or Chains-of-Thoughts (CoT) for multi-step question answering (QA). They struggle, however, when the necessary knowledge is either unavailable to the LLM or not up-to-date within its parameters. While using the question to retrieve relevant text from an external knowledge source helps LLMs, we observe that this one-step retrieve-and-read approach is insufficient for multi-step QA. Here, \textit{what to retrieve} depends on \textit{what has already been derived}, which in turn may depend on \textit{what was previously retrieved}. To address this, we propose IRCoT, a new approach for multi-step QA that interleaves retrieval with steps (sentences) in a CoT, guiding the retrieval with CoT and in turn using retrieved results to improve CoT. Using IRCoT with GPT3 substantially improves retrieval (up to 21 points) as well as downstream QA (up to 15 points) on four datasets: HotpotQA, 2WikiMultihopQA, MuSiQue, and IIRC. We observe similar substantial gains in out-of-distribution (OOD) settings as well as with much smaller models such as Flan-T5-large without additional training. IRCoT reduces model hallucination, resulting in factually more accurate CoT reasoning. Code, data, and prompts are available at \url{https://github.com/stonybrooknlp/ircot}


Q: 这篇论文试图解决什么问题?

A: 这篇论文试图解决的问题是如何提高大型语言模型(LLMs)在知识密集型多步问答(QA)任务中的性能。具体来说,它关注以下几个挑战:

  1. 知识获取限制:当LLMs所需的知识不在其参数中可用或不是最新的时,它们在多步QA任务中会遇到困难。
  2. 检索与推理的结合:传统的一步检索和读取方法(即仅使用问题作为查询)对于多步QA任务来说是不够的,因为后续的检索可能依赖于之前推理的中间结果。
  3. 减少模型幻觉:在没有外部知识支持的情况下,模型可能会生成错误的推理步骤,导致事实错误。

为了解决这些问题,论文提出了一种新的方法,称为IRCoT(Interleaved Retrieval guided by Chain-of-Thought),它通过将检索与CoT(Chain-of-Thought)推理步骤交织在一起,相互指导,以提高检索质量和下游QA性能。这种方法使用CoT来指导检索,反过来使用检索结果来改进CoT推理,从而在多个数据集上显著提高了检索和QA性能。


Q: 有哪些相关研究?

A: 论文中提到了与IRCoT相关的几个研究领域和具体工作,主要包括:

  1. Prompting for Open-Domain QA:研究如何通过使用少量示例作为提示(prompts)来训练大型语言模型(LLMs)执行各种任务,包括开放域问答(ODQA)。这些方法通常利用了LLMs的零次或少次学习能力。
  2. Multi-Step Open-Domain QA:提出了几种方法来处理多步开放域问答任务,例如SelfAsk、DecomP和ReAct。这些方法通常涉及将问题分解为子问题,并可能使用外部API(如Google搜索)或特定的子模型来解决这些子问题。
  3. Supervised Multi-Step Open-Domain QA:在完全监督的设置中探索迭代检索用于开放域问答的研究。这些方法通常依赖于大规模数据集的监督训练,并可能使用神经查询表示和阅读理解模型的输出来更新查询。
  4. Retrieval-Augmented Language Models:研究如何通过检索外部知识源来增强语言模型的能力,以处理知识密集型任务。这些方法可能涉及使用BM25等检索算法,以及使用Dense Passage Retrieval (DPR)等技术。
  5. Chain-of-Thought Prompting:Wei等人(2022)提出的CoT(Chain-of-Thought)提示方法,通过生成一系列自然语言推理步骤来帮助模型解决复杂问题。这种方法在IRCoT中被用来指导检索过程。
  6. Few-Shot Learning with LLMs:研究如何利用大型语言模型在少量示例的情况下学习新任务,这与IRCoT中使用的方法相关,因为它不需要额外的训练,而是通过检索和推理来解决问题。