Aicosoft - AI & Technology News, Insights & Innovation

It’s 3 AM. Your phone buzzes with an alert that makes your heart sink. Something is broken in production. You stumble to your laptop, eyes blurry, trying to piece together clues from a dozen different dashboards. Is it latency? Error rates? A bad deploy? The clock is ticking, and every minute of downtime feels like an eternity.

We’ve all been there. It’s a stressful, high-stakes scramble. But what if you had a team of AI agents that could handle the first response for you? An automated system that could not only detect the incident but also investigate the cause, synthesize the findings, and draft a postmortem before you’ve even had your first cup of coffee.

That sounds like science fiction, right? Well, it’s closer than you think. Today, I want to walk you through how we can build exactly that: a multi-agent system powered by Haystack that acts like a seasoned Site Reliability Engineering (SRE) team. This isn't just a simple chatbot. We're talking about a coordinated group of AI specialists that can reason, use tools, and solve complex problems, end-to-end.

Let’s build it together.

First, We Need to Create a Controlled Disaster

You can't test a firefighting team without a fire. To see our AI agents in action, we first need to create a realistic, but completely synthetic, production incident. We need data—the kind of metrics and logs you’d see in a real system.

So, we'll generate two main types of data for a fictional 24-hour period:

System Metrics: Think of this as the system's heartbeat. We'll track things like requests per second (RPS), 95th percentile latency (p95_ms), and the overall error rate. We’ll make it look real, with natural daily peaks and valleys.
System Logs: These are the detailed, moment-by-moment event records from different services like api-gateway, payments, and auth.

Here’s the crucial part: we’re going to deliberately inject an "incident" into this data. For a specific window of time, we’ll make the p95 latency spike, the error rate climb, and the request throughput drop. We'll also make the logs within this window reflect the problem, with more warnings, errors, and specific failure messages like UpstreamTimeout and DBConnPoolExhausted.

By creating this synthetic world, we have a perfect, reproducible playground to test our AI agents. We know exactly when and how the incident happened, so we can fairly judge how well our AI team performs. We'll load all this data into an in-memory database called DuckDB, which is super fast for the kind of analytical queries our agents will need to run.

Giving Our Agents Their Superpowers: The Tools

An agent is only as good as the tools it can use. An LLM on its own can't query a database or analyze a data frame. So, we need to equip our agents with a set of custom "tools" they can call upon to do the real investigative work.

Think of these as specialized functions that the AI can choose to run when it needs to find something out. Here are the key tools we’ll build for them:

The Detective Tools:

detect_incident_window(): This is the agent's "Spidey-sense." It uses a statistical method called a Z-score to scan a metric (like latency) and find periods that are wildly different from the norm. It’s how the agent first realizes, "Hey, something weird is happening here."
sql_investigate(query): This is the agent’s magnifying glass. It gives the agent the power to run SQL queries directly against our metrics and logs database. It can ask specific questions like, "Show me all the errors from the payments service during the incident window."
log_pattern_scan(): This tool is like an expert speed-reader. It can instantly scan thousands of logs from the incident window and summarize the most common error types, the noisiest services, and the most affected endpoints. It helps the agent quickly pinpoint where the smoke is coming from.

The Strategist Tools:

propose_mitigations(hypothesis): Once the agent has a theory about the root cause (e.g., "I think the database connection pool was exhausted"), this tool helps it brainstorm concrete, actionable fixes. It suggests things like increasing the pool size or adding a circuit breaker, complete with team owners and estimated timelines.
draft_postmortem(): This is the final step. The agent takes everything it has learned—the timeline, the customer impact, the root cause, the key facts, and the mitigation plan—and assembles it all into a structured, production-grade postmortem document.

With this toolkit, our agents can do much more than just process text. They can actively investigate a problem, form a hypothesis, and document their findings just like a human engineer would.

Assembling the AI Dream Team

Now for the really cool part. We’re not just building one monolithic agent. We’re building a team of specialized agents, each with its own role and expertise. This is a core concept in creating advanced AI systems: breaking down a complex task and assigning parts of it to agents designed for that specific job.

Our incident response team has three members:

The Profiler Agent (The Analyst): This agent is our data specialist. Its entire job is to be the detective. It uses the detect_incident_window, sql_investigate, and log_pattern_scan tools to dig through the data. Its goal is to analyze all the raw signals and synthesize them into a crisp, clear summary: what happened, who was affected, and a falsifiable hypothesis about the root cause.
The Writer Agent (The Communicator): This agent is our technical writer. It’s not great at investigation, but it excels at communication. Its only tool is draft_postmortem. It takes the structured findings from the Profiler and the mitigation plan and formats them into a perfect postmortem document.
The Coordinator Agent (The Incident Commander): This is the team lead. The Coordinator is the only agent we talk to directly. It understands the overall goal: "resolve the incident." It manages the entire workflow from start to finish. It knows to first call the detection tools, then delegate the deep investigation to the Profiler, and finally, hand off the findings to the Writer to draft the report.

This structure is what makes the system so powerful. The Coordinator doesn't need to know the nitty-gritty of SQL queries. It just needs to know that it has a "Profiler" specialist on its team that it can rely on for that. By orchestrating these specialists, the Coordinator can solve a problem that would be too complex for any single agent to handle on its own.

Showtime: Putting the AI Team to Work

Alright, the stage is set. We have our synthetic incident data, our agents have their tools, and the team is assembled. Now, let's kick things off.

We give the Coordinator a simple, high-level task:

"We have an incident in the last 24h. Investigate using the provided CSVs. Produce a realistic postmortem JSON and a runbook checklist."

And that’s it. We press "run" and watch the Coordinator take over.

Here’s a glimpse of its thought process, which we can see through the chain of tool calls:

Detection: The Coordinator first calls detect_incident_window on the p95_ms metric. It gets a result back: "Found an anomaly window starting at T0 and ending at T1." Now it knows when the problem occurred.
Delegation to the Profiler: The Coordinator now knows enough to bring in the specialist. It calls the profiler_specialist tool (which is our Profiler agent) and gives it the incident window.
Investigation: The Profiler gets to work. It runs a log_pattern_scan on that specific time window. It discovers that UpstreamTimeout and DBConnPoolExhausted are the top errors, and the payments and db-proxy services are the most affected. It then runs a few sql_investigate queries to confirm its suspicions.
Synthesis: The Profiler synthesizes its findings into a structured JSON object containing a clear hypothesis: "The root cause appears to be database connection pool exhaustion in the db-proxy service, leading to upstream timeouts in the api-gateway and payments services." It sends this back to the Coordinator.
Mitigation: The Coordinator now has a solid hypothesis. It calls the propose_mitigations tool with this hypothesis, which returns a list of smart, long-term fixes.
Delegation to the Writer: Finally, the Coordinator has all the pieces. It calls the postmortem_writer_specialist tool, handing it the incident window, the root cause hypothesis, the key facts, and the mitigation plan.
Final Report: The Writer agent uses its draft_postmortem tool to generate the final, clean JSON document and sends it back.

The Coordinator then presents the final output: a concise executive summary, the full postmortem JSON, and a handy runbook checklist for the human team. All done autonomously, in a fraction of the time it would take a human to even begin.

This is Just the Beginning

What we've built here is more than just a cool demo. It's a blueprint for a new way of thinking about automation. By using a framework like Haystack to create coordinated, tool-using agents, we can move beyond simple chatbots and start tackling real, complex, multi-step workflows.

The key takeaway isn't just about the LLM's ability to generate text; it's about the power of orchestration. By breaking down a problem and building a team of specialists, we can create AI systems that are more robust, more reliable, and far more capable.

Imagine applying this pattern to other areas: automating financial audits, managing complex cloud infrastructure, or even helping with scientific research. The possibilities are genuinely exciting. So next time you get that 3 AM alert, you might just have an AI team already on the case.

We Built an AI Team to Handle On-Call Incidents (Here's How)

First, We Need to Create a Controlled Disaster