Aicosoft - AI & Technology News, Insights & Innovation

Building an app with a Large Language Model (LLM) is an amazing experience. That first time you get a clever response back, it feels like magic. But then you try to move from a cool demo to a real product, and things get... messy.

How do you know what the model is actually doing? When a user gets a bad answer, how do you trace back the exact inputs, prompts, and steps that led to it? Suddenly, your magical black box feels more like a frustrating, opaque one.

This is where LLM observability comes in, and it's what separates the hobby projects from the production-grade applications. Today, we're going to roll up our sleeves and build a complete observability pipeline using Langfuse, an incredible open-source platform for LLM engineering. We'll cover everything from simple tracing to managing prompts, scoring outputs, and running structured experiments.

And the best part? You can follow along and understand every single feature without spending a dime on API calls. We've set this up to work with a real OpenAI key or a simple mock LLM, so you can learn the ropes risk-free. Let's get started.

First Things First: Getting Connected

Before we can build our pipeline, we need to get our credentials in order. Think of this as getting the keys to your new workshop. We need to install the necessary packages and tell our code how to talk to the Langfuse servers.

We’ll start by installing the langfuse and openai libraries. Then, you'll need to grab your Langfuse public and secret keys from your project settings. The script will ask for them, along with your region (EU or US).

One of the coolest things we'll do here is make the OpenAI key optional. If you provide one, great! We'll use a real model like GPT-4o Mini. If you don't, no problem at all. We'll use a built-in mock LLM that gives deterministic answers. This means you can trace the entire workflow and see how everything connects in Langfuse without making a single paid API call.

After you pop in your credentials, the code will do a quick authentication check. If you see a "Connected to Langfuse" message, you’re golden.

The Easiest Way to Start: Tracing with a Simple Decorator

So, what's the simplest way to see what your code is doing? In Langfuse, it’s a one-word answer: decorators.

If you're not a Python wizard, a decorator is just a little tag you put on top of a function (you'll see it as @observe()). It’s like putting a little GPS tracker on your code. Anytime that function runs, the decorator intercepts it and sends all the important details—inputs, outputs, errors, duration—straight to Langfuse.

Let's try it with a simple task: writing a one-sentence story.

We'll have one function, write_story, that calls our LLM. Then, we’ll have another function, story_pipeline, that calls write_story. By adding @observe() to both, we create a nested trace. When you run this, you’ll see the full-story pipeline in Langfuse, and you can click into it to see the individual LLM generation step.

It’s that easy. No complex setup, just one line of code to get instant visibility.

Let's Get Granular: Manually Tracing a RAG Pipeline

Decorators are great for simple cases, but sometimes you need more control. Let's build a small RAG (Retrieval-Augmented Generation) pipeline to see how you can trace more complex, multi-step processes.

If you're new to RAG, the idea is simple:

Retrieve: Find relevant information from a knowledge base.
Augment: Add that information to your prompt.
Generate: Ask the LLM to answer the user's question using the provided context.

We’ll create a tiny knowledge base about product refunds and warranties. Our retrieve function will find the right piece of text based on the user's question. Then, our rag_pipeline function will combine it all and ask the LLM for an answer.

Here’s the important part: we're not just tracing the whole thing as one big blob. We're tracing the retrieve step separately. This gives us a beautiful, granular view in Langfuse. You can see exactly how long the retrieval took, what documents it found, and then how that context was used in the final generation.

We're also adding some extra metadata, like user_id, session_id, and tags ("rag", "support-bot"). Think about how useful this is in a real application. If a user complains about a bad answer, you can filter by their user ID in Langfuse and see every single step of their interaction. It’s a debugging superpower.

Tired of Scattered Prompts? Let's Centralize Them

As your application grows, managing prompts becomes a real headache. You have system prompts, user prompts, and few-shot examples scattered across different files. If you want to change one, you have to hunt it down, and it's hard to track which version of a prompt was used for a specific generation.

Langfuse solves this with its built-in Prompt Management system.

We can create a "managed prompt" directly in Langfuse. We'll make one for a support agent, complete with variables like {{tone}} and {{company}}. This prompt now lives in a central, version-controlled hub.

In our code, we simply fetch the prompt by its name (langfuse.get_prompt("support-agent")). We can then "compile" it with our specific variables for that call (e.g., tone="friendly"). When we make the LLM call, we link it directly to the prompt object.

Now, when you look at the trace in Langfuse, it won't just show you the text of the prompt. It will show you a link to the exact version of the managed prompt that was used. This is a game-changer for experimenting and debugging. You can update your prompt in the Langfuse UI, and your application will automatically pull the latest version, all while keeping a perfect historical record.

Is Your LLM Actually Any Good? How to Score Your Outputs

Observability isn't just about seeing what happened; it's also about judging the quality of what happened. Was the LLM's answer good? Was it helpful? Was it even correct? This is where scoring comes in.

Langfuse lets you attach scores to any trace. These aren't just random numbers; they can be different data types to fit what you're measuring:

Numeric: A number, like a quality score from 1-10.
Categorical: A label, like "helpful" or "unhelpful."
Boolean: A simple true/false or 1/0, like "resolved_issue."

We'll go back to the RAG trace we generated earlier and add a few scores to it. For example, we can create a simple keyword_overlap function to check if the answer mentioned "5-7 business days." If it did, we give it a "groundedness" score of 1.0. We can also add a categorical score for user feedback and a boolean score for whether the issue was resolved.

You can also score things inline as they happen. We’ll demonstrate this by asking for the capital of Japan and immediately scoring the answer for correctness within the same observed function. All these scores will show up right alongside your traces in Langfuse, making it easy to see at a glance which generations are hitting the mark and which aren't.

Putting It to the Test: Running Structured Experiments

Okay, we can trace our calls and score them. Now for the really powerful part: running controlled experiments.

Tweaking a prompt and "feeling" like it's better isn't a strategy. You need data. Langfuse allows you to create datasets and run experiments to rigorously compare different models, prompts, or configurations.

Here’s how it works:

Create a Dataset: We'll build a small dataset of capital-city questions and their expected answers (e.g., "What is the capital of France?", "Paris").
Define a Task: This is the function that takes a dataset item (the question) and runs it through our LLM.
Define Evaluators: These are functions that score the results. We’ll create one for accuracy (does the output contain the right capital?) and another for conciseness (how long was the response?).

With these pieces in place, we can run an experiment. Langfuse will iterate through every item in our dataset, run the task, apply the evaluators, and collect all the results. At the end, it gives you a tidy summary with aggregate scores, like mean_accuracy.

This is how you make real, data-driven improvements to your LLM app. You can run one experiment with Prompt A, another with Prompt B, and compare the accuracy, cost, and latency side-by-side. No more guesswork.

Bonus Round: Plugging Langfuse into LangChain

A lot of you are probably using frameworks like LangChain to build your apps. Good news: Langfuse integrates seamlessly.

If you provided an OpenAI key, we can quickly show how this works. You just import the CallbackHandler from Langfuse and pass it into your LangChain invoke call. That's it. Langfuse will automatically capture the entire chain execution, breaking down each step for you. It’s an incredibly low-effort way to get deep insights into your chains.

See It All Come Together

After running all this code, we'll flush everything to Langfuse to make sure all the data is sent. Now for the fun part: head over to your Langfuse project in your browser.

You'll be able to see:

Traces: All our calls, from the simple story generator to the complex RAG pipeline, complete with user IDs and tags.
Prompts: Our versioned support-agent prompt, ready to be edited and tracked.
Scores: The groundedness, feedback, and accuracy scores attached to their specific traces.
Datasets: Our capital-cities dataset and the results from the experiment we just ran.

We've just built a complete, end-to-end workflow for LLM observability and evaluation. We moved from blind execution to having a structured, production-ready system for monitoring, debugging, and improving our application. This is the foundation for building reliable and high-quality AI products. Go ahead and explore—your data is waiting

A Practical Guide to LLM Observability: Building a Full Pipeline with Langfuse

First Things First: Getting Connected

The Easiest Way to Start: Tracing with a Simple Decorator

Let's Get Granular: Manually Tracing a RAG Pipeline

Tired of Scattered Prompts? Let's Centralize Them

Is Your LLM Actually Any Good? How to Score Your Outputs

Putting It to the Test: Running Structured Experiments

Bonus Round: Plugging Langfuse into LangChain

See It All Come Together

Tags

Source

Stay Updated

Related Articles

Stop Guessing: How to Actually Test Your RAG Apps with DeepEval

Tired of Messy LLM Outputs? How to Get Perfect JSON with Outlines and Pydantic

Your LLM is a Black Box. Here’s How to See Inside.

A Practical Guide to LLM Observability: Building a Full Pipeline with Langfuse

First Things First: Getting Connected

The Easiest Way to Start: Tracing with a Simple Decorator

Let's Get Granular: Manually Tracing a RAG Pipeline

Tired of Scattered Prompts? Let's Centralize Them

Is Your LLM Actually Any Good? How to Score Your Outputs

Putting It to the Test: Running Structured Experiments

Bonus Round: Plugging Langfuse into LangChain

See It All Come Together

Tags

Source

Stay Updated

Related Articles

Stop Guessing: How to Actually Test Your RAG Apps with DeepEval

Tired of Messy LLM Outputs? How to Get Perfect JSON with Outlines and Pydantic

Your LLM is a Black Box. Here’s How to See Inside.

Cookie Settings