Ever feel like you're flying blind with your LLM app?
You've built a cool RAG (Retrieval-Augmented Generation) system. It connects to your documents, answers questions, and it seems to work. But deep down, you've got that nagging feeling... is it actually good? Is it going to hallucinate and make things up, embarrassing you in front of users?
Manually testing these things is a nightmare. You ask a few questions, check a few answers, and hope for the best. But that’s not engineering; that’s guesswork. In traditional software, we have unit tests to give us confidence. We don't just ship code and cross our fingers.
So, why are we doing that with our AI apps?
Today, I want to show you how to stop guessing and start engineering. We’re going to walk through a way to bring the rigor of unit testing to your LLM applications using a fantastic open-source tool called DeepEval. Think of it as building a safety net that automatically catches when your RAG system is failing, and more importantly, tells you why.
First Things First: Setting Up a Real Testing Ground
Before we can test anything, we need a stable environment. This is the equivalent of setting up your workshop before starting a project. We're going to get our tools in order, and the star of the show is DeepEval.
We'll also grab a few other helpful libraries to handle data and calculations. The goal here isn't just to install packages; it's to build a reliable pipeline where we can trust the results. If you've ever had a project break because of a weird dependency issue, you know how important this step is.
We'll also need to connect to an LLM for the "judging" part later, so we'll set up our OpenAI API key. Don't worry, we'll also build a fallback so you can follow along even without it.
Creating Our "Source of Truth"
You can't test a system without an answer key. For our RAG system, that means two things:
- A Knowledge Base: This is the collection of documents our RAG app is supposed to know about. For this walkthrough, we'll use a small set of documents explaining what DeepEval itself does.
- A "Golden" Question Set: These are the test questions. For each question, we'll write down what a perfect answer would look like. This is our
expected_output.
This is a critical step. By creating this "gold dataset," we're defining what "good" looks like. We're no longer relying on a vague feeling; we have a concrete benchmark to measure our system against.
Here are a few examples of what our questions and ideal answers look like:
-
Query: "What is DeepEval used for?"
-
Expected Answer: "DeepEval is used to evaluate and unit test LLM applications using metrics like LLM-as-a-judge, G-Eval, and RAG metrics."
-
Query: "What does faithfulness measure in a RAG system?"
-
Expected Answer: "Faithfulness measures whether the generated answer is supported by the retrieved context and avoids hallucinations not grounded in that context."
See? We're setting clear expectations from the start.
Building Our Simple Librarian: The Retriever
Okay, we have our documents. Now, how does our RAG system find the right information to answer a question? That’s the job of the retriever.
Think of the retriever as a librarian. You ask a question, and the librarian runs off to the shelves to find the most relevant books (or, in our case, text chunks).
For this example, we’re going to build a simple but effective librarian using a classic technique called TF-IDF. It basically looks at the unique words in your query and finds documents that contain those same words. It's not the fanciest vector database on the block, but it's fast, easy to understand, and perfect for showing how all the pieces of our testing pipeline fit together.
Our custom retriever will take a query, scan our knowledge base, and return the top few most relevant document snippets. This is the "R" in RAG—the retrieval part.
Generating an Answer: The Smart Part and the Backup Plan
Once our librarian (the retriever) brings back a stack of relevant documents, the "G" in RAG—generation—kicks in. This is where a large language model, like GPT-4, reads the context we just gave it and formulates a human-like answer.
We'll set up a simple function that sends the user's query and the retrieved context to OpenAI and asks it to generate a concise answer.
But what if the API is down? Or what if you don't want to use an API key for a quick test? We'll build a backup plan.
This fallback is a simple "extractive" system. It just pulls the most relevant-sounding sentences directly from the retrieved text. It’s not as eloquent as an LLM, but it ensures our pipeline can run end-to-end, no matter what. This is just good, pragmatic engineering.
The Core Idea: Turning RAG Outputs into Test Cases
This is where everything comes together. We’re going to run our evaluation queries through the entire RAG pipeline we just built.
For each query, we will:
- Use our retriever to find the most relevant context.
- Use our generator (the LLM or the fallback) to create an answer.
- Bundle everything together into a special object called an
LLMTestCase.
This LLMTestCase is the heart of our whole operation. It’s a neat little package that contains everything we need for a single test:
- The original
input(the user's query). - The
actual_output(the answer our RAG system generated). - The
retrieval_context(the documents our retriever found). - The
expected_output(the perfect answer we wrote earlier).
By creating a list of these test cases, we’ve effectively turned our LLM's behavior into something we can systematically evaluate. We're ready for the judges.
Meet the Judges: DeepEval's LLM-Powered Metrics
So, how do we actually score these test cases? This is where the magic of DeepEval and the "LLM-as-a-Judge" concept comes in.
Instead of writing complex code to check for things like "faithfulness" or "relevancy," we use another, more powerful LLM (like GPT-4) to act as an impartial judge. We give it the test case and a set of clear instructions (the metric), and it gives us a score and, crucially, a reason for its score.
We’re going to set up a whole panel of these expert judges, each looking for something different:
- Answer Relevancy: Does the answer actually address the user's question? (Simple, but you'd be surprised how often this fails).
- Faithfulness: Is the answer grounded in the provided context? This is our chief hallucination detector. If the model makes a claim that isn't supported by the retrieved documents, this judge will flag it.
- Contextual Relevancy: Did our librarian bring back the right books? This metric checks if the retrieved context was actually relevant to the query.
- Contextual Precision & Recall: These two go deeper into the retrieval step. Precision checks if the most relevant documents are ranked highest, while Recall checks if we retrieved all the necessary information to answer the question.
- G-Eval (Custom Rubric): This one is incredibly powerful. It lets you write your own evaluation criteria in plain English. For example, we can create a "RAG Correctness" judge that scores the answer based on a simple rubric: "The answer must be correct, not invent facts, and be concise."
With this panel of metrics, we’re no longer just checking if the answer "looks right." We're performing a multi-point inspection on our entire RAG pipeline.
The Verdict: From Scores to Actionable Insights
Now for the moment of truth. We run the evaluation.
DeepEval takes our list of test cases and our panel of metric-judges and gets to work. It systematically goes through each test, collecting scores and explanations from each judge.
The final output is a beautiful, detailed report card for our RAG system, usually presented as a table. For every single question we asked, we can see exactly how it scored on Faithfulness, Relevancy, and all the other metrics.
This is where it gets really useful. You might see a pattern:
- Low Faithfulness scores? Your generation prompt might be too loose. You need to instruct your LLM more strictly to only use the provided context.
- Low Contextual Recall scores? Your retriever is the problem. It’s failing to find all the necessary information, so the generator is starved of context. Maybe it's time to upgrade from TF-IDF to a more powerful vector search.
- High Faithfulness but low Answer Relevancy? The model is correctly using the context, but it's still failing to answer the actual question. The issue might be in the generation prompt again.
This is the difference between guessing and engineering. We now have specific, actionable data that tells us exactly where our system is weak and how to fix it. We’ve moved from "I think the retriever is okay" to "I have data showing our contextual recall is 0.6, and we need to improve it."
This systematic approach is how you build reliable, production-ready AI applications. It's how you gain the confidence to ship your product, knowing you have a safety net to catch failures before your users do.




