Building an AI Incident Response Team with OpenAI Swarm

Akram Chauhan
Akram Chauhan
9 min read244 views
Building an AI Incident Response Team with OpenAI Swarm

It’s 2 AM. Your phone buzzes with that all-too-familiar, heart-stopping alert. Production is on fire. Latency has spiked, errors are climbing, and you’re the one on call. For the next hour, you’ll be frantically digging through logs, trying to correlate metrics, and piecing together what went wrong, all while your brain is still half-asleep.

We’ve all been there. It’s stressful, chaotic, and incredibly manual.

But what if you had a team of specialists ready to jump in the second that alert fired? A team that could instantly triage the issue, suggest solutions, draft customer communications, and even prepare handoff notes for the next engineer? That’s the promise of multi-agent AI systems, and honestly, it’s not science fiction anymore.

Today, I want to show you something really cool. We’re going to build a practical, production-ready incident response team using a framework called OpenAI Swarm. Forget the heavy, complex setups. We'll do this right in a Colab notebook, keeping things clean, simple, and surprisingly powerful. Think of it as assembling your own AI-powered pit crew for when things go sideways.

First, Let's Get Our Workshop Set Up

Before we can build our team, we need to get our tools in order. This is the boring-but-necessary part, but I'll make it quick. We’re just installing the OpenAI and Swarm libraries and setting up our API key.

If you’re running this in Google Colab, it has a handy userdata feature to securely store your API key. If not, it’ll just prompt you to enter it. Simple.

# Make sure we have the right tools
!pip -q install -U openai
!pip -q install -U "git+https://github.com/openai/swarm.git"

import os

# A safe way to load your OpenAI API key
def load_openai_key():
    try:
        from google.colab import userdata
        key = userdata.get("OPENAI_API_KEY")
    except Exception:
        key = None
    
    if not key:
        import getpass
        key = getpass.getpass("Enter OPENAI_API_KEY (hidden): ").strip()

    if not key:
        raise RuntimeError("OPENAI_API_KEY not provided")
    
    return key

os.environ["OPENAI_API_KEY"] = load_openai_key()

With our key loaded, the next step is to initialize the Swarm client. This is the central nervous system of our whole operation. It’s what lets our different AI agents talk to each other, share information, and hand off tasks.

import json
import re
from typing import List, Dict
from swarm import Swarm, Agent

# This is our main controller
client = Swarm()

Alright, setup complete. Now for the fun part.

Giving Our Agents a Brain (and a Company Playbook)

An AI agent is only as good as the information it has access to. We can’t expect it to magically know our company’s specific incident response procedures. So, let's give our agents a small, internal knowledge base to work from.

Think of this like a mini-wiki or a set of runbooks that we can feed to our agents on demand.

KB_DOCS = [
    {
        "id": "kb-incident-001",
        "title": "API Latency Incident Playbook",
        "text": "If p95 latency spikes, validate deploys, dependencies, and error rates. Rollback, cache, rate-limit, scale. Compare p50 vs p99 and inspect upstream timeouts."
    },
    {
        "id": "kb-risk-001",
        "title": "Risk Communication Guidelines",
        "text": "Updates must include impact, scope, mitigation, owner, and next update. Avoid blame and separate internal vs external messaging."
    },
    {
        "id": "kb-ops-001",
        "title": "On-call Handoff Template",
        "text": "Include summary, timeline, current status, mitigations, open questions, next actions, and owners."
    },
]

Of course, just having the documents isn’t enough. We need a way for our agents to search them. We’ll create a simple search function that finds the most relevant document based on a query. It's not fancy—just some basic text matching—but it's incredibly effective for giving the agents the context they need, right when they need it.

def _normalize(s: str) -> List[str]:
    return re.sub(r"[^a-z0-9\s]", " ", s.lower()).split()

def search_kb(query: str, top_k: int = 3) -> str:
    q = set(_normalize(query))
    scored = []
    for d in KB_DOCS:
        score = len(q.intersection(set(_normalize(d["title"] + " " + d["text"]))))
        scored.append((score, d))
    
    scored.sort(key=lambda x: x[0], reverse=True)
    docs = [d for s, d in scored[:top_k] if s > 0] or [scored[0][1]]
    return json.dumps(docs, indent=2)

Now, when an agent is confused, it can just call search_kb("latency spike") and get our official playbook on the topic. That’s a huge step up from just guessing.

A Tool for Making Smarter, Faster Decisions

During an incident, you’re often faced with a bunch of potential fixes. Should you roll back the latest deploy? Scale up the servers? Rate-limit a noisy customer? Making the right call under pressure is tough.

Let's give our SRE agent a specialized tool to help with this. We’ll create a function that takes a list of possible solutions and ranks them based on their potential confidence and risk. This forces a more structured, data-driven approach to decision-making.

def estimate_mitigation_impact(options_json: str) -> str:
    try:
        options = json.loads(options_json)
    except Exception as e:
        return json.dumps({"error": str(e)})

    ranking = []
    for o in options:
        conf = float(o.get("confidence", 0.5))
        risk = o.get("risk", "medium")
        penalty = {"low": 0.1, "medium": 0.25, "high": 0.45}.get(risk, 0.25)
        
        ranking.append({
            "option": o.get("option"),
            "confidence": conf,
            "risk": risk,
            "score": round(conf - penalty, 3)
        })

    ranking.sort(key=lambda x: x["score"], reverse=True)
    return json.dumps(ranking, indent=2)

This is a game-changer. Instead of just getting a wall of text, the agent can now produce a ranked list of actions, helping the human engineer prioritize much more effectively.

Assembling the Dream Team: Meet Your Agents

This is where the "multi-agent" system really comes to life. We’re not building one giant, know-it-all AI. Instead, we’re creating a team of specialists, each with a very specific job. Think of it like a real-world incident response team.

First, we need a way for them to pass the baton. We'll define a few simple "handoff" functions. These are like telling one agent, "Your job is done, pass this over to the SRE."

# These functions tell Swarm which agent to pass control to next
def handoff_to_sre():
    return sre_agent

def handoff_to_comms():
    return comms_agent

def handoff_to_handoff_writer():
    return handoff_writer_agent

def handoff_to_critic():
    return critic_agent

Now, let's introduce the team members:

1. The Triage Agent: This is our dispatcher. Its only job is to look at the initial alert and decide who is best equipped to handle it. Is it a technical problem for the SRE? A communication task for the Comms agent? The Triage agent makes that first critical decision.

triage_agent = Agent(
    name="Triage",
    model="gpt-4o-mini",
    instructions="""
    Decide which agent should handle the request. 
    Use SRE for incident response. 
    Use Comms for customer or executive messaging. 
    Use HandoffWriter for on-call notes. 
    Use Critic for review or improvement.
    """,
    functions=[search_kb, handoff_to_sre, handoff_to_comms, handoff_to_handoff_writer, handoff_to_critic]
)

2. The SRE Agent: This is our boots-on-the-ground engineer. It takes the incident details, searches the knowledge base, and formulates a concrete action plan using the ranking tool we built.

sre_agent = Agent(
    name="SRE",
    model="gpt-4o-mini",
    instructions="""
    Produce a structured incident response with triage steps, 
    ranked mitigations, ranked hypotheses, and a 30-minute plan.
    """,
    functions=[search_kb, estimate_mitigation_impact]
)

3. The Comms Agent: While the SRE is working on the fix, the Comms agent is handling communication. It drafts clear, concise updates for both external customers and internal stakeholders, following the guidelines from our knowledge base.

comms_agent = Agent(
    name="Comms",
    model="gpt-4o-mini",
    instructions="""
    Produce an external customer update and an internal technical update.
    """,
    functions=[search_kb]
)

4. The Handoff Writer: No one likes writing handoff notes at the end of a long incident. This agent takes all the information generated so far and formats it into a clean, standard handoff document for the next on-call engineer.

handoff_writer_agent = Agent(
    name="HandoffWriter",
    model="gpt-4o-mini",
    instructions="""
    Produce a clean on-call handoff document with standard headings.
    """,
    functions=[search_kb]
)

5. The Critic: This is my favorite part. The Critic agent acts as a quality control layer. Its job is to review the output from the other agents and suggest improvements. It ensures the final response is polished, thorough, and ready for human eyes.

critic_agent = Agent(
    name="Critic",
    model="gpt-4o-mini",
    instructions="""
    Critique the previous answer, then produce a refined final version and a checklist.
    """
)

By breaking down the problem this way, each agent can be simpler, more focused, and ultimately more reliable.

Putting It All Into Motion

We’ve built the team, given them tools, and defined their roles. Now, let’s see them in action. We'll create a simple pipeline that simulates a real incident.

The user provides an initial alert. The Triage agent picks it up and hands it off to the right specialist (or specialists). After they do their work, the Critic agent comes in for a final review.

def run_pipeline(user_request: str):
    # Initial run with the Triage agent, which will hand off to others
    messages = [{"role": "user", "content": user_request}]
    r1 = client.run(agent=triage_agent, messages=messages, max_turns=8)
    
    # Now, ask the Critic to review and improve the output
    messages2 = r1.messages + [{"role": "user", "content": "Review and improve the last answer"}]
    r2 = client.run(agent=critic_agent, messages=messages2, max_turns=4)
    
    return r2.messages[-1]["content"]

# Here's our 2 AM alert
request = """
Production p95 latency jumped from 250ms to 2.5s after a deploy. 
Errors slightly increased, DB CPU stable, upstream timeouts rising. 
Provide a 30-minute action plan and a customer update.
"""

# Let's run the pipeline!
print(run_pipeline(request))

And just like that, with one function call, you kick off an entire collaborative workflow. The agents work together, using their specialized tools and knowledge, to produce a comprehensive response that includes a technical plan, customer-facing communication, and a final quality check.

This is More Than Just a Cool Demo

What I love about this approach is how clear and controllable it is. We’re not just throwing a prompt at a massive model and hoping for the best. We’ve designed a system with clear roles, specialized tools, and a built-in feedback loop.

This pattern—specialized agents, tool augmentation, and a final critique step—is an incredibly powerful way to build reliable AI systems that can handle complex, real-world tasks. It moves us away from unpredictable, free-form generation and toward structured, agentic workflows.

The next time you get that late-night alert, imagine having this AI team right there with you, instantly providing a first draft of your entire response. It won't replace your expertise, but it can be an unbelievably powerful co-pilot, freeing you up to focus on the highest-level strategic decisions. And maybe, just maybe, it’ll help you get back to sleep a little faster.

Tags

AI OpenAI LLMs Automation Agentic AI AI Engineering MLOps Software Development Enterprise AI AI Workflow Automation Production AI AI Solutions Multi-Agent Systems Incident Response Automation OpenAI Swarm Tool-Augmented Agents AI for Operations DevOps AI Site Reliability Engineering (SRE) AI Colab Notebook Tutorial

Stay Updated

Get the latest articles and insights delivered straight to your inbox.

We respect your privacy. Unsubscribe at any time.

Aicosoft

AI & Technology News, Insights & Innovation

AICOSOFT delivers cutting-edge AI news, technology breakthroughs, and innovation insights. Stay informed about artificial intelligence, machine learning, robotics, and the latest tech trends shaping tomorrow.

Connect With Us

© 2026 Aicosoft. All rights reserved.