RAG Without Vectors – PageIndex: Reasoning-Based Document Indexing

12 points by vectify_AI 3 months ago

We were frustrated by vector-based RAG systems that rely on semantic similarity and often fail on long, domain-specific documents. In these contexts, domain-specific terminology tends to be semantically similar, making it hard to retrieve the exact content users need. It’s also difficult to incorporate expert knowledge or user preferences effectively. So we started exploring a more reasoning-driven approach to RAG. Inspired by the tree search algorithm in AlphaGo, we came up with a reasoning-based RAG system that uses tree search to guide retrieval.

We open-sourced one of the key components: PageIndex, a hierarchical indexing system that transforms large documents (like financial reports, regulatory documents, or textbooks) into semantic trees optimized for reasoning-based RAG.

Some highlights:

- Hierarchical Structure: Organizes lengthy PDFs into LLM-friendly trees — like a smart table of contents.

- Precise Referencing: Each node includes a summary and exact physical page numbers.

- Natural Segmentation: Nodes align with document sections, preserving context — no arbitrary chunking.

We've used PageIndex for financial document analysis with reasoning-based RAG and saw significant improvements in retrieval accuracy compared to vector-based systems.

Would love any feedback — especially thoughts on reasoning-based RAG, or ideas for where PageIndex could be applied!

bsenftner 3 months ago

Very interesting work. What are your opinions of GraphRAG and the variations?

I'm currently evaluating systems that extend RAG as your PageIndex project does, with an eye on adaptability to new information.

A good portion of my work involves legal issues and case law, and with the US going through a lot of legal transformations with the new administration, I am seeking a system that can ingest new information that imposes new rules on the handling of information, and those new rules need to impose precedence over any similar such rules already in the knowledgebase.

This new information ingestion and logical resolution within the larger knowledgebase needs to be efficient too. The initial GraphRAG is expensive to begin with, and does not appear to have any optimized handling for ingesting of new, conflicting information. The GraphRAG variants that are getting a lot of attention now appear to be addressing the lack of efficiency in the original GraphRAG implementation. Where does PageIndex set within this group of similar offerings?

mingtianzhang 3 months ago

Hi Bsenftner, thanks for your interest.
The motivation behind building PageIndex is to build a reasoning-based RAG. When we previously designed a RAG system for financial documents, we encountered two main challenges:
1. Traditional embedding-based RAG often returns redundant information because all the financial terms are semantically similar.
2. We want to incorporate expert experience into the RAG process—specifically, experts often have a preferred order of where to look first.
To address these, we developed PageIndex, which transforms long documents into a structured “table of contents.” This allows the LLM to selectively retrieve relevant nodes based on reasoning. With this approach, we can do few-shot learning by providing examples of expert preferences directly in the prompt, enabling the LLM to choose nodes more like a domain expert would.
-----
In your case, it sounds like you're looking for a system that can automatically build and continuously update a knowledge base as new data arrives. You might benefit from something like:
1. Using expert knowledge to define a template knowledge graph—e.g., specifying entity types, link types, or a rough graph structure.
2. Building an agent that updates the knowledge graph when new documents are received. The agent’s tasks could include:
a. Identifying new information relevant to existing nodes or links.
b. Determining whether this new information changes the current knowledge graph.
c. Updating the graph accordingly.
Since your use case involves logical reasoning (not just semantic similarity), PageIndex and reasoning-based RAG could play a helpful role here. In other words, while a traditional graph-based RAG might still be used at inference (question answering) time, PageIndex and reasoning-based RAG can assist during the knowledge graph update phase by identifying related information in the new documents that are related to the graph. Additionally, the tree structure produced by PageIndex can be used as an initialization for building your knowledge graph.
Hope this is helpful! Mingtian
- bsenftner 3 months ago
  
  Thank you for the continued interest and support. I've got PageRank working now, and am in my exploratory R&D period. The space is deep and dynamic, plus I've got non-R&D responsibilities too. You'll be hearing from me, as I start to integrate.

vectify_AI 3 months ago

GitHub repo: https://github.com/VectifyAI/PageIndex/ Open to feedback and suggestions.

chiccomagnus 3 months ago

Have you compared this solution with tools like Preprocess, Reducto, etc.. ? I'm curious about the performance gain you can achieve with your approach

mingtianzhang 3 months ago

Hi, I think our approach can benefit from these OCR tools to generate a better tree for search. We will update more on this point in our GitHub repo, thanks for raising this question!

Imanari 3 months ago

Interesting work! How do you construct the relationship between nodes if not all documents fit into context?

mingtianzhang 3 months ago

Hi Imanari! That’s essentially one of the key challenges we’re aiming to address with our PageIndex package.
We’ve designed two LLM functions:
a. LLM Function 1: init_content -> initial_structure
b. LLM Function 2: (previous_structure, current_content) -> current_structure
The idea is to split a long document into several page groups (each within the context window size). You first apply Function 1 to the first group to get the initial structure, then use Function 2 in a for-loop over the remaining page groups to recursively build out the rest of the structure.
This approach is commonly used in representation learning for time-series data. We'll be releasing a technical report on it soon as well.
Mingtian
- Imanari 3 months ago
  
  Thanks! I have thought about similar approaches of iteratively building the content-graph of your document base, as you described. I worry about scaling, though. IIUC both previous_structure and current content must fit into context while previous_structure is getting bigger with each iteration, correct?
  EDIT: follow up question, how long does the structure-building take for 100 pages and how big are the chunks you are feeding in?

medlearner 3 months ago

[dead]

medlearner 3 months ago

[dead]