Token usage: How much is this thing actually costing us to run?

Response quality: Are the answers helpful, or are they nonsense

Latency: How long are users waiting for a response?

Model drift: Is the model slowly getting worse over time

Aicosoft - AI & Technology News, Insights & Innovation

So, you’ve built an amazing app powered by a Large Language Model (LLM). It’s clever, it’s powerful, and when it works, it feels like magic.

But what about when it doesn’t work?

When a user gets a weird response, or the whole thing just grinds to a halt, where do you even start looking? Unlike the software we’re used to, LLMs are a bit of a black box. They’re probabilistic, meaning they don’t follow a clear, predictable path from A to B. You can’t just step through the code and find the bug.

This is a huge problem. We’re moving past the "cool demo" phase of AI and into a world where these systems are running critical parts of our businesses. And just like any other piece of production software, we need to be able to see what’s going on inside. We need observability.

What is AI Observability, Anyway?

In traditional software, we’ve had this figured out for years. We use logs, metrics, and tracing to understand what our applications are doing. AI observability is the same idea, but adapted for the unique weirdness of AI.

It’s about tracking things like:

Token usage: How much is this thing actually costing us to run?
Response quality: Are the answers helpful, or are they nonsense?
Latency: How long are users waiting for a response?
Model drift: Is the model slowly getting worse over time?

Without this visibility, you’re essentially flying blind. You’re hoping for the best, and that’s just not a sustainable way to run a real product.

Let's Unpack the Black Box with an Example

To make this feel less abstract, let’s imagine we’ve built an AI-powered resume screening tool. A recruiter uploads a PDF, and the system spits out a recommendation: "shortlist," "reject," or "needs review."

Simple enough, right? But under the hood, it’s not just one step. It’s a whole pipeline, a mini assembly line. And if all you see is the final output ("reject"), you have no idea why a great candidate might have been tossed aside.

This is where the two core concepts of observability come in: traces and spans.

Think of a trace as the entire journey of a single resume through our system. From the moment the recruiter clicks "upload" to the final recommendation appearing on screen, that entire end-to-end process is one trace. It gets a unique ID, so we can follow its complete story.

But the real magic happens inside the trace, with spans. A span is one specific step or operation along that journey.

Our resume screener’s assembly line might look like this:

Span 1: The Upload. The file comes in. This span just logs the basics: when it arrived, its size, its format. This is where our trace begins.
Span 2: The Parsing. The system has to turn that messy PDF or Word doc into clean, structured text. This span tracks how long that takes and, crucially, if it fails. A weirdly formatted resume could break everything right here.
Span 3: Feature Extraction. Now the AI reads the text and pulls out the important stuff—skills, years of experience, past job titles. This span shows us what the AI thinks it found.
Span 4: The Scoring. The extracted features are fed into a model that calculates a score. This is often the most expensive, computer-intensive part. This span logs how long it took and the confidence of its score.
Span 5: The Decision. Based on the score, the system makes its final call. This span records that final output and how long it took to generate.

So, Why Should You Care About Spans?

Without this level of detail, a bad recommendation is just a mystery. You’d have to guess what went wrong.

Was the PDF unreadable? Did the AI miss the "Python" keyword during extraction? Did the scoring model behave strangely?

With span-level observability, you don't have to guess. You can pinpoint the exact stage where things went off the rails. You can see that the parsing span is taking way too long, or that the feature extraction span is consistently missing key skills from certain resume templates.

This isn't just for debugging. It’s also about money and performance. By looking at your spans, you might realize that 90% of your cloud bill is coming from the scoring step. Maybe there’s a way to optimize that specific model without touching the rest of the pipeline. You can finally see where your time and money are actually going.

The Real-World Payoffs of Seeing Clearly

When you get this right, AI observability delivers three huge benefits that every team needs.

You Keep Your Costs in Check. AI can get expensive, fast. Observability shows you exactly which parts of your system are guzzling resources. You can spot inefficiencies and make targeted optimizations instead of just throwing more money at the problem.
You Stay Out of Trouble (Compliance). For something like a resume screener, you absolutely need an audit trail. Observability automatically creates a record of every input, every decision, and every timestamp. If a regulator asks how a certain candidate was processed, you have the data to prove your system is fair and compliant.
Your AI Actually Gets Better Over Time. Models degrade. The world changes, new resume formats appear, and the skills people list evolve. Observability helps you detect this "drift" early. You can see when performance on a specific component starts to dip and fix it before it becomes a systemic problem, ensuring your AI stays sharp.

Getting Started: A Few Open-Source Tools to Know

Okay, this all sounds great, but how do you actually implement it? The good news is you don’t have to build everything from scratch. There’s a growing ecosystem of open-source tools designed for this exact problem.

Here are a few popular ones to check out:

Langfuse

Langfuse has gotten really popular, and for good reason. It’s an all-in-one platform for LLM development that gives you end-to-end tracing, tools to evaluate your model's outputs (with human or AI feedback), and dashboards to monitor cost and performance. The big selling point is that it’s model-agnostic, so you can plug it into your existing stack whether you're using OpenAI, LangChain, or something else.

Arize Phoenix

Arize has been in the ML monitoring game for a while, and Phoenix is their open-source project focused specifically on LLMs. It’s built on OpenTelemetry (a widely-used standard), so it plays nicely with other tools. Phoenix is great for deep debugging and even has built-in features to help you detect hallucinations, which is a massive headache for anyone building with LLMs. It’s a solid choice if you want something powerful you can host yourself.

Trulens

Trulens takes a slightly different approach. It’s less focused on the infrastructure and more on the quality of the LLM's responses. You attach what it calls "feedback functions" to your LLM calls to automatically score things like relevance, coherence, or toxicity. Think of it as a lightweight, automated quality checker that helps you make sure your AI’s answers are actually good, not just fast.

The bottom line is this: as AI becomes a fundamental part of our software, we can't afford to treat it like a magic box we don't understand. Gaining visibility into how these systems think, work, and fail isn't just a "nice-to-have" for data scientists anymore. It’s becoming a core discipline for building AI products that are reliable, fair, and worthy of our users' trust.