多文档问答 (MD-QA) 涉及回答需要跨多个文档综合信息的问题。
MD-QA introduces some unique challenges compared to other QA tasks:与其他 QA 任务相比,MD-QA 引入了一些独特的挑战:
Multi-hop reasoning: Answering a question may require logical reasoning across multiple passages from different documents. For example, bridging entities across documents.
多跳推理:回答一个问题可能需要跨不同文档的多个段落进行逻辑推理。例如,跨文档桥接实体。
Retrieval latency: Retrieving useful passages from a large collection of documents is time-consuming.
检索延迟:从大量文档中检索有用的段落非常耗时。
Diverse modalities: Documents may contain varied structures like paragraphs, lists, tables, images etc. Reasoning across these modalities is difficult.
多种形式:文档可能包含不同的结构,如段落、列表、表格、图像等。对这些模式进行推理是困难的。
The first key idea in KGP is to construct a knowledge graph (KG) to represent the relationships between content across documents. Each passage or structural element like a table or figure becomes a node in the graph. Edges are added between nodes based on:KGP 中的第一个关键思想是构建一个知识图谱 (KG) 来表示跨文档内容之间的关系。每个段落或结构元素(如表格或图形)都成为图形中的一个节点。Edge 是根据以下条件在节点之间添加的:
Lexical similarity: Nodes are connected if they share rare or key tokens. This helps associate related passages.
词法相似性:如果节点共享稀有或关键令牌,则节点是连接的。这有助于关联相关段落。
Semantic similarity: Passage embeddings can be compared to find semantically related nodes.
语义相似性:可以比较段落嵌入以找到语义相关的节点。
Structural relationships: Relations like “Page 1 contains Passage A” are added to capture document structure.
结构关系:添加“第 1 页包含段落 A”等关系以捕获文档结构。
段落和结构成为相互连接的节点,边缘表示词汇/语义相似性和文档结构关系。结果是一个知识图谱,对文档内和文档间内容之间的关系进行编码。

However, blindly traversing the graph can lead to noisy and redundant passages.但是,盲目地遍历图形可能会导致嘈杂和冗余的通道。
This motivates the second key idea — using a fine-tuned language model to guide the traversal:这激发了第二个关键思想——使用微调的语言模型来指导遍历:
A seq2seq model like T5 is fine-tuned to take previous passages and generate the next “evidence” passage needed to answer the question.
When traversing the KG, the model’s predicted next passage is matched against candidate nodes to pick the best node to visit next.
遍历 KG 时,模型预测的下一个通道将与候选节点进行匹配,以选择下一个要访问的最佳节点。
The retrieved nodes are added to the context, and the process repeats iteratively.
检索到的节点将添加到上下文中,并且该过程以迭代方式重复。
Finally, the entire context is consumed by the LLM to produce the answer.
最后,LLM 使用整个上下文来生成答案。
The key steps are: 关键步骤是:
Split documents into passages: Each document is segmented into individual passages, like sentences or paragraphs.
将文档拆分为段落:每个文档都细分为单独的段落,如句子或段落。
Represent passages as nodes: Every passage becomes a node in the knowledge graph.
将段落表示为节点:每个段落都成为知识图谱中的一个节点。
Add edges between nodes: Edges are added between passage nodes based on:
在节点之间添加边:根据以下条件在通道节点之间添加边:
Lexical similarity: Nodes are connected if they share rare/key extracted keywords. This captures topical similarity.
词法相似性:如果节点共享稀有/关键提取的关键字,则节点是连接的。这捕捉到了主题的相似性。
Semantic similarity: Passage embeddings can be compared to find nodes with embedding proximity. Connects semantically related nodes.
语义相似性:可以比较段落嵌入以找到具有嵌入邻近性的节点。连接语义相关的节点。
Add structural nodes: Additional nodes are created for structures like tables and pages extracted from PDFs.
添加结构节点:为从 PDF 中提取的表格和页面等结构创建其他节点。
Link structures to passages: Directed edges are added from structural nodes to passage nodes to capture “belongs to” relationships.
将结构链接到通道:从结构节点到通道节点添加有向边,以捕获“属于”关系。
Use markdown for tables: The textual content of tables is represented in markdown format. This enables LLMs to correctly interpret tables.
表格使用 markdown:表格的文本内容以 markdown 格式表示。这使 LLM 能够正确解释表。