Have you ever tried to find something in a PDF, and your search for "quarterly results" completely misses the giant, colorful bar chart that has exactly what you need? It’s frustrating, right?
Traditional text search is, well, text-based. It’s blind. It reads the words but completely ignores the layout, the tables, the diagrams, and the figures—often the most important parts of a document. It’s like trying to understand a comic book by only reading the text in the speech bubbles. You get some of the story, but you miss all the action.
What if we could teach our search system to see the document, just like we do? That's exactly what we're going to do today. We're going to build a simple but incredibly powerful visual search pipeline using a model called ColPali. Instead of treating a PDF page as a wall of text, we'll treat it as an image, preserving all that rich visual context.
Let's get our hands dirty and build a system that can find information based on what a page looks like.
First Things First: Setting Up a Clean Workspace
Before we start building anything cool, we need to prep our environment. We’ve all been there—that dreaded "dependency hell" where one library conflicts with another, and you spend hours just trying to get things to run. We're going to sidestep that mess right from the start.
This little bit of code is our insurance policy. It uninstalls a few common troublemakers (like old versions of Pillow and PIL) and then installs the specific, compatible versions we need. It’s like clearing your workbench before starting a new project.
import subprocess, sys, os, json, hashlib
def pip(cmd):
subprocess.check_call([sys.executable, "-m", "pip"] + cmd)
pip(["uninstall", "-y", "pillow", "PIL", "torchaudio", "colpali-engine"])
pip(["install", "-q", "--upgrade", "pip"])
pip(["install", "-q", "pillow<12", "torchaudio==2.8.0"])
pip(["install", "-q", "colpali-engine", "pypdfium2", "matplotlib", "tqdm", "requests"])
By getting this out of the way now, the rest of our journey should be smooth sailing.
Waking Up the AI: Loading ColPali
With our environment ready, it's time to bring in the brains of the operation: the ColPali model. Think of this as loading up the specialized software that knows how to look at an image of a document and understand its contents.
We'll also handle a few housekeeping tasks here, like checking if you have a GPU (which will make things way faster) and setting the model up for inference.
import torch
import requests
import pypdfium2 as pdfium
from PIL import Image
from tqdm import tqdm
import matplotlib.pyplot as plt
from transformers.utils.import_utils import is_flash_attn_2_available
from colpali_engine.models import ColPali, ColPaliProcessor
device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.float16 if device == "cuda" else torch.float32
MODEL_NAME = "vidore/colpali-v1.3"
model = ColPali.from_pretrained(
MODEL_NAME,
torch_dtype=dtype,
device_map=device,
attn_implementation="flash_attention_2" if device == "cuda" and is_flash_attn_2_available() else None,
).eval()
processor = ColPaliProcessor.from_pretrained(MODEL_NAME)
You can see we’re loading both a model and a processor. The model is the heavy lifter that does the understanding, and the processor is like its assistant, preparing all the data (our images and text queries) in just the right format for the model to work with.
Turning a PDF into a Picture Book
Okay, here’s where the magic really starts. We need to take a standard PDF and convert each page into an image. This is the core idea that separates our approach from traditional text search. We're not extracting text; we're capturing a snapshot of the entire page.
Let's grab a sample PDF from the web—in this case, a research paper—and render its pages. To keep things quick for this tutorial, we'll just process the first 15 pages.
PDF_URL = "https://arxiv.org/pdf/2407.01449.pdf"
pdf_bytes = requests.get(PDF_URL).content
pdf = pdfium.PdfDocument(pdf_bytes)
pages = []
MAX_PAGES = 15
for i in range(min(len(pdf), MAX_PAGES)):
page = pdf[i]
img = page.render(scale=2).to_pil().convert("RGB")
pages.append(img)
Now, instead of a document, we have a list of images stored in memory, with each image representing a single page. We're ready to show them to our AI.
Creating a "Visual Fingerprint" for Each Page
So, how does an AI "understand" an image? It converts it into a list of numbers called an embedding. You can think of this embedding as a unique "visual fingerprint" that represents the content and layout of the page. Pages that look similar will have similar fingerprints.
We're going to loop through our page images, feed them to ColPali in small batches (to avoid running out of memory), and collect these fingerprints.
page_embeddings = []
batch_size = 2 if device == "cuda" else 1
for i in tqdm(range(0, len(pages), batch_size)):
batch_imgs = pages[i:i+batch_size]
batch = processor.process_images(batch_imgs)
batch = {k: v.to(model.device) for k, v in batch.items()}
with torch.no_grad():
emb = model(**batch)
page_embeddings.extend(list(emb.cpu()))
page_embeddings = torch.stack(page_embeddings)
What we’re left with is a page_embeddings tensor. This is our searchable database! It's a collection of all the visual fingerprints for our document, ready to be compared against any question we ask.
The Moment of Truth: Asking a Question
We’ve done all the prep work. We turned the PDF into images and created a visual fingerprint for each page. Now, let's ask a question and see if it can find the right page.
This is where "late interaction scoring" comes in. It sounds complicated, but the idea is simple. Instead of just comparing one big fingerprint for the query against one big fingerprint for the page, it does a more detailed, part-by-part comparison. This makes the matching much more accurate.
Let's define a function to handle this retrieval process and then ask it a real question.
def retrieve(query, top_k=3):
q = processor.process_queries([query])
q = {k: v.to(model.device) for k, v in q.items()}
with torch.no_grad():
q_emb = model(**q).cpu()
scores = processor.score_multi_vector(q_emb, page_embeddings)[0]
vals, idxs = torch.topk(scores, top_k)
return [(int(i), float(v)) for i, v in zip(idxs, vals)]
def show(img, title):
plt.figure(figsize=(6,6))
plt.imshow(img)
plt.axis("off")
plt.title(title)
plt.show()
query = "What is ColPali and what problem does it solve?"
results = retrieve(query, top_k=3)
for rank, (idx, score) in enumerate(results, 1):
show(pages[idx], f"Rank {rank} — Page {idx+1}")
When you run this, you'll see the top 3 pages that the model thinks best answer the query "What is ColPali and what problem does it solve?". It's not just matching keywords; it's looking for the page that visually represents the answer, which is often an introduction, an abstract, or a section with a key diagram.
To make this a bit more reusable, we can create a little search helper that just returns the results in a clean format.
def search(query, k=5):
return [{"page": i+1, "score": s} for i, s in retrieve(query, k)]
print(json.dumps(search("late interaction retrieval"), indent=2))
And there you have it. In just a few blocks of code, you've built a search system that doesn't just read—it sees. You’ve created embeddings once, and now you can reuse them to ask as many questions as you want, getting back a ranked list of the most visually relevant pages.
This is a powerful foundation. From here, you could scale this up to handle thousands of documents, add a search index for near-instant results, or even feed the top-ranked pages to a language model to generate a direct answer. But the core, layout-aware retrieval pipeline is right here, and it's surprisingly simple to get started. Pretty cool, huh?




