Let's be honest for a second. We've all been there. You build this beautiful RAG (Retrieval-Augmented Generation) pipeline. You feed it a dense, 50-page financial report. You ask a simple question, and the answer you get back is... almost right.
It pulls a paragraph that has all the right keywords, but it's completely out of context. The system found something that was similar, but not actually relevant. It's a subtle but infuriating problem, and it's the quiet little secret where most RAG systems start to fall apart.
The issue is that we've been relying on a clever trick: vector similarity. We chop up our documents into little chunks, turn them into a string of numbers (embeddings), and then find the chunks that are mathematically "closest" to our question. But "close" is a pretty weak substitute for "correct," especially in professional documents like legal texts, research papers, or technical manuals.
The real answer often requires understanding the document's structure—knowing that the methodology section connects to the results, or that a footnote on page 3 clarifies a statement on page 27. This is where a new approach called PageIndex completely changes the game.
So, What if We Ditched Vectors Entirely?
Imagine instead of shredding a book into a pile of confetti, you just handed the AI its table of contents. That's the core idea behind PageIndex.
Instead of chunking and embedding, it builds a hierarchical tree of the document, just like a detailed table of contents. It identifies every chapter, section, and subsection, preserving the author's original structure. Then, it uses a powerful LLM (like GPT-4) to reason over this tree to figure out where the answer most likely lives.
It’s a lot like how you or I would tackle a new textbook. We don't read every single word from the start. We scan the chapter titles, look at the section headings, and think, "Ah, the part about complexity trade-offs is probably in the 'Model Architecture' section or maybe the 'Results' section."
PageIndex does exactly that. The LLM acts as a smart research assistant, navigating the document's map before it ever reads the full text. This vectorless, reasoning-first approach is not only more accurate but also way more transparent. You can literally see the AI's thought process.
To show you what I mean, let's get our hands dirty and walk through a real example. We're going to use PageIndex to index the most famous AI paper of the last decade: "Attention Is All You Need."
Alright, Let's Get Our Hands Dirty: The Setup
First things first, we need to get everything installed and grab our API keys. You'll need a key from both PageIndex and OpenAI for this to work.
Once you have your keys, a simple pip install will get the necessary libraries on your machine.
pip install pageindex openai requests
Now, let's set up our clients in a Python script. I like using getpass so I don't accidentally hardcode my secret keys.
from pageindex import PageIndexClient
import pageindex.utils as utils
import os
from getpass import getpass
PAGEINDEX_API_KEY = getpass('Enter PageIndex API Key: ')
pi_client = PageIndexClient(api_key=PAGEINDEX_API_KEY)
We'll do the same for OpenAI. We also need a little helper function to make calling the LLM a bit cleaner. We'll make it asynchronous since that's how the latest OpenAI library likes to roll.
import openai
OPENAI_API_KEY = getpass('Enter OpenAI API Key: ')
async def call_llm(prompt, model="gpt-4-turbo", temperature=0):
client = openai.AsyncOpenAI(api_key=OPENAI_API_KEY)
response = await client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
temperature=temperature
)
return response.choices[0].message.content.strip()
(Note: I've updated the model to gpt-4-turbo from the original's gpt-5.4, which isn't publicly available as of writing this. The logic remains the same!)
With our tools ready, we can move on to the fun part.
Step 1: Turning a PDF into a 'Smart' Table of Contents
We're going to grab the Transformer paper straight from arXiv. The code below will download it, save it locally, and then submit it to the PageIndex API.
# ─────────────────────────────────────────────
# Step 1: Build the PageIndex Tree
# ─────────────────────────────────────────────
# 1.1 Download the Transformer paper and submit it
import os, requests
pdf_url = "https://arxiv.org/pdf/1706.03762.pdf"
pdf_path = os.path.join("data", pdf_url.split("/")[-1])
os.makedirs("data", exist_ok=True)
print("Downloading 'Attention Is All You Need'...")
response = requests.get(pdf_url)
with open(pdf_path, "wb") as f:
f.write(response.content)
print(f" Saved to {pdf_path}")
doc_id = pi_client.submit_document(pdf_path)["doc_id"]
print(f" Document submitted. doc_id: {doc_id}")
Once submitted, PageIndex gets to work processing the PDF. This can take a minute or two, so we'll write a little loop that polls the API until our document tree is ready.
# 1.2 Retrieve the tree (poll until ready)
import time
print("\nWaiting for PageIndex tree to be ready", end="")
while not pi_client.is_retrieval_ready(doc_id):
print(".", end="", flush=True)
time.sleep(5)
tree = pi_client.get_tree(doc_id, node_summary=True)["result"]
print("\n\n Document Tree Structure:")
utils.print_tree(tree)
When this is done, you'll see a beautifully structured, hierarchical tree printed to your console. Every section, from the "Abstract" to "Conclusion" and all the nested subsections in between, is now a "node" in our tree. This is our map.
Step 2: The Magic Moment - Asking the LLM to Think
Now for the main event. We're going to ask a question that is intentionally tricky for traditional RAG systems—one that requires connecting ideas from different parts of the paper.
Our query: "Why did the authors choose self-attention over recurrence, and what are the complexity trade-offs they compared?"
A simple vector search might find the section on "Self-Attention" and stop there. But we know the complexity trade-offs are likely in a different section, maybe in the results or a specific table.
Here's how we handle it with PageIndex. We take our tree, but we strip out all the full text. We only give the LLM the node titles and their summaries. Then we wrap it all in a prompt.
# ─────────────────────────────────────────────
# Step 2: Reasoning-Based Retrieval
# ─────────────────────────────────────────────
# 2.1 Define a query that requires navigating across sections
import json
query = "Why did the authors choose self-attention over recurrence, and what are the complexity trade-offs they compared?"
# We only need the structure for the LLM to reason over
tree_without_text = utils.remove_fields(tree.copy(), fields=["text"])
search_prompt = f"""
You are given a question and a hierarchical tree structure of a research paper.
Each node has a node_id, title, and a summary of its content.
Your task: identify ALL nodes that are likely to contain information relevant to answering the question.
Think carefully -- the answer may be spread across multiple sections.
Question: {query}
Document tree: {json.dumps(tree_without_text, indent=2)}
Reply ONLY in this JSON format, no preamble:
{{
"thinking": "<step-by-step reasoning about which nodes are relevant and why>",
"node_list": ["node_id_1", "node_id_2", ...]
}}
"""
print(f' Query: "{query}"\n')
print("Running tree search with GPT-4...")
tree_search_result = await call_llm(search_prompt)
# 2.2 Inspect the retrieval reasoning and matched nodes
node_map = utils.create_node_mapping(tree)
result_json = json.loads(tree_search_result)
print("\n LLM Reasoning:")
utils.print_wrapped(result_json["thinking"])
print("\n Retrieved Nodes:")
for node_id in result_json["node_list"]:
node = node_map[node_id]
print(f" • [{node['node_id']}] Page {node['page_index']:>2} -- {node['title']}")
This is the absolute core of what makes PageIndex so different. The LLM isn't searching text; it's analyzing a structure. You can see its step-by-step reasoning, which is incredibly powerful for debugging and building trust in the system. It will likely identify the introduction (for motivation), the model architecture section, and the results section where complexity is discussed.
Step 3: Putting It All Together for a Grounded Answer
Once the LLM has given us its list of promising node IDs, now we go and fetch the full text for just those sections. We stitch them together into a single context block for our final answer generation prompt.
# ─────────────────────────────────────────────
# Step 3: Answer Generation
# ─────────────────────────────────────────────
# 3.1 Stitch together context from all retrieved nodes
node_list = result_json["node_list"]
relevant_content = "\n\n---\n\n".join(
f"[Section: {node_map[nid]['title']}]\n{node_map[nid]['text']}"
for nid in node_list
)
print(f"\n Retrieved Context Preview (first 1200 chars):\n")
utils.print_wrapped(relevant_content[:1200] + "...\n")
# 3.2 Generate a structured answer grounded in the retrieved sections
answer_prompt = f"""
You are a technical assistant. Answer the question below using ONLY the provided context.
Be specific -- reference actual design choices, numbers, and trade-offs mentioned in the text.
Question: {query}
Context: {relevant_content}
Structure your answer as:
1. The core motivation for choosing self-attention
2. The specific complexity comparisons made (include any tables or numbers)
3. Any caveats or limitations the authors acknowledged
"""
print(" Generating answer...\n")
answer = await call_llm(answer_prompt)
print("─" * 60)
print(" Final Answer:\n")
utils.print_wrapped(answer)
print("─" * 60)
The final answer is grounded, precise, and draws from multiple parts of the source document, just like a human expert would. No more out-of-context nonsense.
One More Time: Proving the Tree is Reusable
The best part? That tree we built is permanent. We don't have to re-index or re-process anything to ask another question. Let's try a more focused one.
query2 = "How does the multi-head attention mechanism work, and what is the role of scaling in dot-product attention?"
search_prompt2 = f"""
You are given a question and a hierarchical tree structure of a research paper. Identify all nodes likely to contain the answer.
Question: {query2}
Document tree: {json.dumps(tree_without_text, indent=2)}
Reply ONLY in this JSON format:
{{
"thinking": "<reasoning>",
"node_list": ["node_id_1", ...]
}}
"""
print(f'\n\n Second Query: "{query2}"\n')
result2_raw = await call_llm(search_prompt2)
result2 = json.loads(result2_raw)
print(" Reasoning:")
utils.print_wrapped(result2["thinking"])
relevant_content2 = "\n\n---\n\n".join(
f"[Section: {node_map[nid]['title']}]\n{node_map[nid]['text']}"
for nid in result2["node_list"]
)
answer_prompt2 = f"""
Answer the following question using ONLY the provided context. Explain the mechanism clearly, as if for a technical blog post.
Question: {query2}
Context: {relevant_content2}
"""
answer2 = await call_llm(answer_prompt2)
print("\n Answer:\n")
utils.print_wrapped(answer2)
The same process repeats: the LLM scans the same tree, quickly identifies the exact subsection on "Multi-Head Attention," we pull the text, and generate a perfect answer. It's efficient, intelligent, and just makes a lot more sense.
This shift from "similarity" to "reasoning" feels like a major step forward. It suggests that the future of AI interacting with our documents will be less about clever math tricks and more about genuine, structural understanding. We're teaching the machine not just to match words, but to actually read. And that's a future I'm pretty excited about.




