Let's Build an AI Agent That Fact-Checks Itself with LlamaIndex and OpenAI

Akram Chauhan
Akram Chauhan
7 min read129 views
Let's Build an AI Agent That Fact-Checks Itself with LlamaIndex and OpenAI

Have you ever asked a chatbot a question and gotten an answer that was confidently, completely, and utterly wrong?

It’s a weirdly common experience. We’ve all seen AI "hallucinate"—just making things up and presenting them as fact. It’s one of the biggest roadblocks holding AI back from being truly useful for serious research or analysis. A simple question-and-answer bot is fun, but it’s not something you’d bet your business on.

But what if we could build something smarter? What if we could design an AI agent that works less like a glib know-it-all and more like a diligent researcher? An agent that doesn’t just spit out an answer, but first looks up the evidence, forms a conclusion, and then—this is the magic part—critiques its own work before showing it to you.

That's exactly what we're going to do today. We're rolling up our sleeves and building a simple but powerful agentic system using two of my favorite tools: LlamaIndex and OpenAI. This isn't just another chatbot. This is a system that can reason, use tools, and evaluate itself. Let's get started.

First Things First: Setting Up Our Workshop

Before we can build anything, we need to get our tools and materials in order. Think of this as clearing the workbench and laying out everything we need. We'll start by installing a few key Python libraries.

# Copy Code
pip -q install -U llama-index llama-index-llms-openai llama-index-embeddings-openai nest_asyncio

This command grabs LlamaIndex, the OpenAI connectors, and a handy little tool called nest_asyncio that helps us run asynchronous code smoothly in environments like Jupyter notebooks.

Next, we need to securely give our project access to the OpenAI API. The absolute last thing you want to do is hardcode your secret API key into your script. It's a huge security risk. Instead, we'll load it securely when we run the code.

# Copy Code
import os
import asyncio
import nest_asyncio

nest_asyncio.apply()

from getpass import getpass

if not os.environ.get("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass("Enter OPENAI_API_KEY: ")

This little snippet checks if the API key is already set. If not, it prompts you to enter it safely. Simple, secure, and the right way to do it.

Giving Our Agent a Brain (and a Library)

An agent is useless without knowledge. We need to give it a source of truth—a small, curated library it can consult to find answers. For this example, we’ll create a tiny knowledge base right in our code.

We also need to choose the AI models that will power our agent. We'll use two: a powerful language model for reasoning (gpt-4o-mini) and an embedding model for understanding the meaning of text (text-embedding-3-small).

# Copy Code
from llama_index.core import Document, VectorStoreIndex, Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

# Configure our models
Settings.llm = OpenAI(model="gpt-4o-mini", temperature=0.2)
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

# Here's our mini knowledge base
texts = [
    "Reliable RAG systems separate retrieval, synthesis, and verification. Common failures include hallucination and shallow retrieval.",
    "RAG evaluation focuses on faithfulness, answer relevancy, and retrieval quality.",
    "Tool-using agents require constrained tools, validation, and self-review loops.",
    "A robust workflow follows retrieve, answer, evaluate, and revise steps."
]

# Turn our texts into 'Documents' that LlamaIndex can understand
docs = [Document(text=t) for t in texts]

# Create a searchable index from our documents
index = VectorStoreIndex.from_documents(docs)
query_engine = index.as_query_engine(similarity_top_k=4)

So, what did we just do? We set up our AI models and then created a few short text snippets about building reliable AI. Then, we used VectorStoreIndex to turn that text into a searchable database. You can think of this index as a super-fast librarian. When our agent needs to find information, the index knows exactly where to look.

The Agent's Superpowers: Retrieval and Self-Evaluation

This is where things get really interesting. An agent is defined by the tools it can use. We're going to give our agent two fundamental skills:

  1. The ability to retrieve evidence.
  2. The ability to score its own answers.

Let’s build these tools. First, the evidence retriever. This is a simple function that takes a question, uses the query engine we just built, and pulls out the most relevant text snippets from our knowledge base.

# Copy Code
from llama_index.core.evaluation import FaithfulnessEvaluator, RelevancyEvaluator

# Set up evaluators that will score the agent's answers
faith_eval = FaithfulnessEvaluator(llm=Settings.llm)
rel_eval = RelevancyEvaluator(llm=Settings.llm)

# Tool #1: A function to retrieve evidence
def retrieve_evidence(q: str) -> str:
    r = query_engine.query(q)
    out = []
    for i, n in enumerate(r.source_nodes or []):
        out.append(f"[{i+1}] {n.node.get_content()[:300]}")
    return "\n".join(out)

# Tool #2: A function to score an answer
def score_answer(q: str, a: str) -> str:
    r = query_engine.query(q)
    ctx = [n.node.get_content() for n in r.source_nodes or []]
    
    # How faithful is the answer to the evidence?
    f = faith_eval.evaluate(query=q, response=a, contexts=ctx)
    
    # How relevant is the answer to the original question?
    r = rel_eval.evaluate(query=q, response=a, contexts=ctx)
    
    return f"Faithfulness: {f.score}\nRelevancy: {r.score}"

See what's happening here? The score_answer tool is the secret sauce. It uses two pre-built LlamaIndex evaluators:

  • FaithfulnessEvaluator: This checks if the agent's answer is factually supported by the evidence it retrieved. In other words, "Did you make this up, or is it in the source material?"
  • RelevancyEvaluator: This checks if the answer actually addresses the user's original question. "Is this a good answer, or did you go off on a tangent?"

By giving our agent these tools, we're empowering it to check its own work.

Bringing It All Together: Creating the Agent

Now we have our knowledge base and our tools. It's time to assemble the agent itself. We'll use a ReActAgent from LlamaIndex. "ReAct" stands for Reason and Act, which is a popular framework for getting agents to think step-by-step.

The most important part here is the system_prompt. This is where we give the agent its core instructions—its mission statement.

# Copy Code
from llama_index.core.agent.workflow import ReActAgent
from llama_index.core.workflow import Context

agent = ReActAgent(
    tools=[retrieve_evidence, score_answer],
    llm=Settings.llm,
    system_prompt="""
    Always retrieve evidence first.
    Produce a structured answer.
    Evaluate the answer and revise once if scores are low.
    """,
    verbose=True
)

ctx = Context(agent)

Look at that prompt. It's so simple, yet so powerful. We’ve given it a clear, four-step workflow:

  1. Retrieve: Find the facts.
  2. Answer: Formulate a response based on those facts.
  3. Evaluate: Use the score_answer tool to check your work.
  4. Revise: If the scores are bad, try again.

By setting verbose=True, we're telling the agent to "think out loud," so we can watch its entire reasoning process.

Let's See It in Action!

The stage is set. The agent is built. It's time for the main event. Let's give it a task and watch it work. We'll ask it to design a reliable RAG workflow—a topic it should know something about, thanks to our little knowledge base.

# Copy Code
async def run_brief(topic: str):
    q = f"Design a reliable RAG + tool-using agent workflow and how to evaluate it. Topic: {topic}"
    handler = agent.run(q, ctx=ctx)
    
    async for ev in handler.stream_events():
        print(getattr(ev, "delta", ""), end="")
        
    res = await handler
    return str(res)

topic = "RAG agent reliability and evaluation"
loop = asyncio.get_event_loop()
result = loop.run_until_complete(run_brief(topic))

print("\n\nFINAL OUTPUT\n")
print(result)

When you run this, you won't just get a final answer. You'll see a stream of text showing the agent's internal monologue. You’ll see it decide to use the retrieve_evidence tool, see the evidence it finds, watch it draft an answer, and then see it call the score_answer tool to grade itself.

This transparency is huge. You're no longer dealing with a black box. You can see how the AI arrived at its conclusion, which is the first step toward building systems we can actually trust.

What we've built here is a blueprint for more responsible AI. By separating the steps of retrieval, synthesis, and evaluation, we've created a loop that forces the agent to be more deliberate and grounded in facts. It’s a simple concept, but it fundamentally changes the game. We're moving away from chatbots that guess and toward agents that think. And that's a very exciting place to be.

Tags

OpenAI LLMs Agentic AI AI Engineering AI System Design AI development Retrieval Augmented Generation (RAG) Self-Evaluating AI LlamaIndex AI Tool Use AI Hallucination Reduction Automated Quality Checks AI Building AI Agents

Stay Updated

Get the latest articles and insights delivered straight to your inbox.

We respect your privacy. Unsubscribe at any time.

Aicosoft

AI & Technology News, Insights & Innovation

AICOSOFT delivers cutting-edge AI news, technology breakthroughs, and innovation insights. Stay informed about artificial intelligence, machine learning, robotics, and the latest tech trends shaping tomorrow.

Connect With Us

© 2026 Aicosoft. All rights reserved.