We’ve all been there. You ask an AI a question, and it spits back an answer that’s beautifully written, incredibly detailed, and… completely, utterly wrong.
It’s called a “hallucination,” and it’s one of the most frustrating and dangerous parts of working with large language models today. They state falsehoods with the same unblinking confidence as well-established facts.
But what if we could teach an AI to have a little humility? What if it could recognize when it’s on shaky ground and tell you, “Hey, I’m only about 60% sure about this”? And what if, after admitting that, it could automatically go find a better answer?
That’s exactly what we’re going to build today. We're moving beyond a simple Q&A bot to create a system that has a built-in reality check. It’s a three-step process that mimics how a smart human works: give an initial answer, pause to self-critique, and if needed, hit the books (or in this case, the web).
Let's get our hands dirty and build an AI that knows what it doesn’t know.
First, Let's Get Our Tools in Order
Before we can build our super-smart AI, we need to set up our workshop. This is just some basic setup stuff—importing the libraries we'll need and getting our OpenAI API key ready.
Think of it like laying out your ingredients before you start cooking. We'll need tools for talking to the OpenAI API, searching the web (we'll use DuckDuckGo for that), and making the output look nice and clean in our terminal.
Here’s the initial setup code:
import os, json, re, textwrap, getpass, sys, warnings
from dataclasses import dataclass, field
from typing import Optional
from openai import OpenAI
from ddgs import DDGS
from rich.console import Console
from rich.table import Table
from rich.panel import Panel
from rich import box
warnings.filterwarnings("ignore", category=DeprecationWarning)
def _get_api_key() -> str:
# (Code to securely get OpenAI API key)
# ... [full code from original]
key = getpass.getpass(" Enter your OpenAI API key: ").strip()
if not key:
Console().print("[bold red]No API key provided — exiting.[/bold red]")
sys.exit(1)
return key
OPENAI_API_KEY = _get_api_key()
MODEL = "gpt-4o-mini" # A great, fast model for this
CONFIDENCE_LOW = 0.55 # Our threshold for triggering a web search
CONFIDENCE_MED = 0.80
client = OpenAI(api_key=OPENAI_API_KEY)
console = Console()
@dataclass
class LLMResponse:
question: str
answer: str
confidence: float
reasoning: str
sources: list[str] = field(default_factory=list)
researched: bool = False
raw_json: dict = field(default_factory=dict)
We're also defining a handy little data structure called LLMResponse. This is just a neat container to hold everything together for each query: the question, the final answer, the confidence score, the AI's reasoning, and any sources it found along the way. It keeps our code clean and organized.
Step 1: Teaching the AI to Be Honest About Its Confidence
The first and most important step is to get the LLM to stop just giving us an answer. We need it to give us an answer and tell us how confident it is.
We do this with a carefully crafted system prompt. A system prompt is like the AI's job description. You give it a role and a set of rules to follow. In our case, the number one rule is: BE HONEST.
Here’s the prompt we’ll use:
SYSTEM_UNCERTAINTY = """
You are an expert AI assistant that is HONEST about what it knows and doesn't know.
For every question you MUST respond with valid JSON only (no markdown, no prose outside JSON):
{
"answer": "<your best answer — thorough, factual>",
"confidence": <float 0.0-1.0>,
"reasoning": "<explain WHY you are or aren't confident; mention specific knowledge gaps>"
}
Confidence scale:
0.90-1.00 → very high: well-established fact, you are certain
0.75-0.89 → high: strong knowledge, minor uncertainty
0.55-0.74 → medium: plausible but you may be wrong, could be outdated
0.30-0.54 → low: significant uncertainty, answer is a best guess
0.00-0.29 → very low: mostly guessing, minimal reliable knowledge
Be CALIBRATED — do not always give high confidence. Genuinely reflect uncertainty about recent events (after your knowledge cutoff), niche topics, numerical claims, and anything that changes over time.
""".strip()
See what we did there? We're not just asking for an answer. We're forcing it to return a JSON object with three specific fields: the answer itself, a confidence score from 0.0 to 1.0, and the reasoning behind that score. We even give it a scale to help it calibrate its own sense of certainty.
Now, let's wrap this in a function that sends the question and our special prompt to the model.
def query_llm_with_confidence(question: str) -> LLMResponse:
completion = client.chat.completions.create(
model=MODEL,
temperature=0.2,
response_format={"type": "json_object"},
messages=[
{"role": "system", "content": SYSTEM_UNCERTAINTY},
{"role": "user", "content": question},
],
)
raw = json.loads(completion.choices[0].message.content)
return LLMResponse(
question=question,
answer=raw.get("answer", ""),
confidence=float(raw.get("confidence", 0.5)),
reasoning=raw.get("reasoning", ""),
raw_json=raw,
)
This function gives us our first response. It's the AI's best initial guess, along with its own self-assessed confidence. But we're not done yet.
Step 2: Adding a "Wait, Let Me Double-Check" Moment
This is where our system gets really clever. Instead of just taking the AI's first answer, we're going to make it perform a self-evaluation. It’s the AI equivalent of pausing and thinking, "Hold on, does that answer actually make sense?"
We'll create another function that takes the initial response and asks the model to critique it.
def self_evaluate(response: LLMResponse) -> LLMResponse:
critique_prompt = f"""
Review this answer and its stated confidence. Check for:
1. Logical consistency
2. Whether the confidence matches the actual quality of the answer
3. Any factual errors you can spot
Question: {response.question}
Proposed answer: {response.answer}
Stated confidence: {response.confidence}
Stated reasoning: {response.reasoning}
Respond in JSON: {{
"revised_confidence": <float — adjust if the self-check changes your view>,
"critique": "<brief critique of the answer quality>",
"revised_answer": "<improved answer, or repeat original if fine>"
}}
""".strip()
completion = client.chat.completions.create(
model=MODEL,
temperature=0.1,
response_format={"type": "json_object"},
messages=[
{"role": "system", "content": "You are a rigorous self-critic. Respond in JSON only."},
{"role": "user", "content": critique_prompt},
],
)
ev = json.loads(completion.choices[0].message.content)
# Update the response with the self-evaluation results
response.confidence = float(ev.get("revised_confidence", response.confidence))
response.answer = ev.get("revised_answer", response.answer)
response.reasoning += f"\n\n[Self-Eval Critique]: {ev.get('critique', '')}"
return response
This self_evaluate function asks the LLM to act as its own worst critic. It might lower its confidence score after realizing its reasoning was weak, or it might even slightly rephrase its answer to be more accurate. This single step adds a powerful layer of meta-cognition.
Step 3: When in Doubt, Google It
So, what do we do if, after self-evaluation, the AI's confidence is still low? We send it to the internet, of course!
If the confidence score is below our threshold (we set it to 0.55 earlier), our system will automatically trigger a web search to gather fresh, external information. This is crucial for questions about recent events or niche topics that might not be in the model's training data.
First, a simple function to perform a web search using DuckDuckGo:
def web_search(query: str, max_results: int = 5) -> list[dict]:
results = DDGS().text(query, max_results=max_results)
return list(results) if results else []
Next, we create a function that takes the low-confidence answer and the search results and synthesizes them into a new, better answer. We use another system prompt to guide the AI, telling it to act as a "research synthesizer."
# SYSTEM_SYNTHESIS prompt from the original article goes here...
def research_and_synthesize(response: LLMResponse) -> LLMResponse:
console.print(f" [yellow] Confidence {response.confidence:.0%} is low — triggering auto-research...[/yellow]")
snippets = web_search(response.question)
if not snippets:
console.print(" [red]No search results found.[/red]")
return response
# Format the search snippets for the prompt
formatted = "\n\n".join(
f"[{i+1}] {s.get('title','')}\n{s.get('body','')}\nURL: {s.get('href','')}"
for i, s in enumerate(snippets)
)
synthesis_prompt = f"""
Question: {response.question}
Preliminary answer (low confidence): {response.answer}
Web search snippets:
{formatted}
Synthesize an improved answer using the evidence above.
""".strip()
# ... [call to OpenAI API with SYSTEM_SYNTHESIS prompt] ...
# Update the response with the new, synthesized answer
syn = json.loads(completion.choices[0].message.content)
response.answer = syn.get("answer", response.answer)
response.confidence = float(syn.get("confidence", response.confidence))
response.reasoning += f"\n\n[Post-Research]: {syn.get('reasoning', '')}"
response.sources = [s.get("href", "") for s in snippets if s.get("href")]
response.researched = True
return response
This function is the safety net. When the AI is unsure, it doesn't just give up. It actively seeks out new information and integrates it to provide a much more reliable and well-grounded answer.
Putting It All Together: The Full Pipeline
Now we just need one main function to orchestrate our three-step process. This function will:
- Get the initial answer and confidence.
- Perform the self-evaluation.
- Check the confidence, and if it's low, trigger the research step.
def uncertainty_aware_query(question: str) -> LLMResponse:
# Step 1: Initial query
response = query_llm_with_confidence(question)
# Step 2: Self-evaluation
response = self_evaluate(response)
# Step 3: Research if confidence is low
if response.confidence < CONFIDENCE_LOW:
response = research_and_synthesize(response)
return response
And that’s it! With these components, we have a complete, uncertainty-aware reasoning pipeline.
Let's See It in Action
Theory is great, but let's see how this system handles a few tricky questions. We'll ask it some things it should know, and some things that are designed to trip it up.
Here are our test questions:
- "What is the speed of light in a vacuum?" (A well-known fact)
- "What were the main causes of the 2008 global financial crisis?" (A complex, nuanced topic)
- "What is the latest version of Python released in 2025?" (A trick question about the future)
- "What is the current population of Tokyo as of 2025?" (Another trick question requiring live data)
When we run these, we can see the system working beautifully. For the speed of light, it gives a high-confidence answer right away. For the financial crisis, it might give a medium-confidence answer.
But for the questions about 2025, it will (correctly!) report very low confidence, trigger the web search, and then report back that it can't find information about the future. It doesn't hallucinate; it admits its own limitations. That's a huge win.
More Than Just a Cool Project
What we've built here is a foundational piece for creating more responsible and trustworthy AI. It's about shifting from AI that just gives answers to AI that can reason about the quality of its own answers.
This kind of self-awareness is critical for any real-world application where the stakes are high, whether it's in medicine, finance, or research. By giving our models the ability to say "I don't know" and the tools to find out, we're building a future where we can rely on AI to be a helpful, and honest, partner.
Go ahead, run the code and try to stump it with your own questions. You might be surprised at how much smarter an AI can be when it's programmed to be a little less certain.




