How We Taught an OpenAI Agent to Get Smarter on Its Own with A-Evolve

Akram Chauhan
Akram Chauhan
11 min read68 views
How We Taught an OpenAI Agent to Get Smarter on Its Own with A-Evolve

Have you ever built an AI agent that was almost perfect? It gets things right 80% of the time, but that last 20% is just maddening. It messes up JSON formatting, forgets a crucial instruction, or just gets a little too creative when you need it to be precise.

We've all been there, stuck in an endless loop of prompt tweaking. It feels more like an art than a science.

But what if we could treat agent improvement like a real engineering problem? What if we could build a system that automatically identifies an agent's weaknesses and then evolves it to get better, cycle after cycle?

That’s exactly what A-Evolve is all about. It’s a framework that lets you set up a "gym" for your AI agent, complete with a workout routine (benchmarks) and a personal trainer (an evolution engine) that pushes it to improve.

Today, I want to walk you through how we can do this from scratch. We’re going to build a simple text-transformation agent, watch it fail, and then use A-Evolve to systematically teach it how to succeed. Let's get our hands dirty.

Getting the Gym Ready: Setting Up Our Environment

First things first, we need to set up our workspace. Think of this as laying out the mats, racking the weights, and getting our agent’s locker ready. We'll be doing this in a Colab notebook, which makes it super easy to follow along.

We start by installing the necessary Python packages and cloning the A-Evolve repository right into our environment.

import os
import sys
import shutil
from pathlib import Path
from getpass import getpass
import subprocess

# Install what we need
subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "openai>=1.30.0", "pyyaml>=6.0", "matplotlib>=3.8"])

# Clone the A-Evolve repo
REPO_DIR = Path("/content/a-evolve")
if REPO_DIR.exists():
    shutil.rmtree(REPO_DIR)
subprocess.check_call(["git", "clone", "--depth", "1", "https://github.com/A-EVO-Lab/a-evolve.git", str(REPO_DIR)])
sys.path.insert(0, str(REPO_DIR))

# Get our OpenAI API key
if not os.environ.get("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key: ").strip()

OPENAI_MODEL = "gpt-4o-mini"

Next, we create the agent's "workspace." This is just a directory with a few special folders for its prompts, skills, and memory. This structure is what allows A-Evolve to modify the agent over time.

We start with a very basic system prompt—just a few lines telling it to be a precise text-transformation agent. This is our agent's starting point. It's not very smart yet, but that's the whole point.

import yaml
import textwrap

WORKSPACE_ROOT = Path("/content/a_evolve_demo_workspace")
if WORKSPACE_ROOT.exists():
    shutil.rmtree(WORKSPACE_ROOT)

# Create the directory structure
(WORKSPACE_ROOT / "prompts").mkdir(parents=True, exist_ok=True)
(WORKSPACE_ROOT / "skills").mkdir(parents=True, exist_ok=True)
(WORKSPACE_ROOT / "memory").mkdir(parents=True, exist_ok=True)

# Define the initial system prompt
initial_system_prompt = textwrap.dedent("""
You are a precise text-transformation agent. Solve the task exactly. Be concise.
Return only the final answer with no explanation unless the task explicitly asks for JSON.
""").strip()

(WORKSPACE_ROOT / "prompts" / "system.md").write_text(initial_system_prompt)

With our environment set up and a barebones agent ready to go, it's time to define what "success" even looks like.

Defining the Workout: Our Custom Benchmark

An agent is only as good as the test you measure it against. In A-Evolve, this test is called a "benchmark." It’s basically a set of tasks with clear right-or-wrong answers.

For our agent, we'll create a small dataset of text-transformation tasks. We’ll ask it to do things like:

  • Sum numbers and return a clean JSON object.
  • Create an acronym from a phrase.
  • Clean up, sort, and format a list of words.
  • Count vowels and determine if the count is odd or even.

We split this into a train set (what the agent practices on) and a holdout set (the final exam to see if it actually learned anything).

def build_dataset():
    train = [
        {"id": "train-01", "rule": "json_sum", "input": "Numbers: 7, 11, 4", "answer": '{"sum":22}'},
        {"id": "train-03", "rule": "acronym_upper", "input": "Create the acronym from: retrieval augmented generation", "answer": "RAG"},
        # ... and so on for other rules
    ]
    holdout = [
        {"id": "holdout-01", "rule": "json_sum", "input": "Numbers: 100, 1, 9", "answer": '{"sum":110}'},
        {"id": "holdout-02", "rule": "acronym_upper", "input": "Create the acronym from: artificial general intelligence", "answer": "AGI"},
        # ... and so on
    ]
    return train, holdout

TRAIN_DATA, HOLDOUT_DATA = build_dataset()

Then, we wrap this data in a custom BenchmarkAdapter. This class tells A-Evolve how to get tasks and, more importantly, how to evaluate the agent's answers. We're going for exact matches here—no partial credit!

import json
from agent_evolve.protocol.base_agent import BaseAgent
from agent_evolve.benchmarks.base import BenchmarkAdapter
from agent_evolve.types import Task, Trajectory, Feedback

class MiniTextBenchmark(BenchmarkAdapter):
    def __init__(self):
        self.train = TRAIN_DATA
        self.holdout = HOLDOUT_DATA

    def get_tasks(self, split: str = "train", limit: int = 10):
        # ... code to package data into Task objects ...
        pass

    def evaluate(self, task: Task, trajectory: Trajectory):
        pred = trajectory.output.strip()
        gold = task.metadata["answer"].strip()
        # Simple string comparison for success
        success = pred.replace(" ", "") == gold.replace(" ", "")
        score = 1.0 if success else 0.0
        return Feedback(success=success, score=score, detail=json.dumps({"pred": pred, "gold": gold}))

Now we have our workout plan. Let's introduce the agent who's going to be doing all the heavy lifting.

Meet the Agent: Our Custom OpenAI Solver

Our agent is a simple class that inherits from A-Evolve's BaseAgent. Its main job is to take a task, think about it, and produce an answer.

Here’s the interesting part: when it gets a task, it doesn't just blindly pass the input to OpenAI. It first looks at its workspace to see if it has any relevant skills or memory that might help.

  • Skills: These are like little instruction manuals for specific tasks. For example, a "json-sum" skill would tell the agent exactly how to format the JSON output.
  • Memory: This is a log of past experiences, especially failures. It helps the agent avoid making the same mistake twice.

The solve method pieces this all together—the system prompt, the user's task, any relevant skills, and recent memories—into one big prompt for the OpenAI model.

from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

class ColabAEResolverAgent(BaseAgent):
    def __init__(self, workspace_dir, model=OPENAI_MODEL):
        self.model = model
        super().__init__(workspace_dir)

    def _pick_relevant_skills(self, task: Task):
        # ... logic to find skills that match the task's rule ...
        pass

    def solve(self, task: Task) -> Trajectory:
        # 1. Load the current system prompt from the workspace
        # 2. Find relevant skills for this specific task
        # 3. Grab the last few memories
        # 4. Build a detailed prompt for the LLM
        # 5. Get the response from OpenAI
        # 6. Save a memory of what just happened
        # 7. Return the final output
        # (This is a simplified explanation of the code in the original article)
        
        # Example of building the user prompt
        user_prompt = f"""
        TASK RULE: {task.metadata.get("rule")}
        TASK INPUT: {task.input}
        
        RELEVANT SKILLS:
        {self.get_skill_content(...)}
        
        RECENT MEMORIES:
        {self.memories[-8:]}
        
        Solve the task exactly.
        """
        response = client.chat.completions.create(...)
        output = response.choices[0].message.content or ""
        self.remember(content=f"Task {task.id} produced output: {output}")
        return Trajectory(task_id=task.id, output=output)

At the start, our agent has no skills and no memory. It's just relying on that one-line system prompt. Let's see how it does on its first day at the gym.

The Personal Trainer: Our Evolution Engine

This is where the real magic happens. The EvolutionEngine is the "personal trainer" that watches the agent, identifies its mistakes, and then modifies its workspace to help it improve.

We'll build a custom engine that follows a few simple rules:

  1. If the agent is making basic formatting mistakes, we'll "harden" its system prompt by adding a strict output contract. This is like the trainer saying, "Focus on your form!"
  2. If the agent fails on a specific type of task (like json_sum), we'll write a new "skill" file for it. This gives the agent a step-by-step guide for that exact task. It’s like the trainer giving a detailed tutorial for a specific exercise.
  3. After every cycle, we'll add a memory of what went wrong. This helps the agent learn from its failures.
from agent_evolve.engine.base import EvolutionEngine
from agent_evolve.types import StepResult
from collections import defaultdict

class ColabMutationEngine(EvolutionEngine):
    def step(self, workspace, observations, history, trial):
        failed_by_rule = defaultdict(list)
        for obs in observations:
            if not obs.feedback.success:
                failed_by_rule[obs.task.metadata["rule"]].append(...)

        mutated = False
        
        # Rule 1: Harden the prompt if it's too generic
        current_prompt = workspace.read_prompt()
        if "STRICT OUTPUT CONTRACT" not in current_prompt:
            workspace.write_prompt(current_prompt + "\n\n" + PROMPT_APPENDIX)
            mutated = True

        # Rule 2: Add skills for rules that the agent failed on
        for rule, fails in failed_by_rule.items():
            if not workspace.has_skill(f"skill-for-{rule}"):
                workspace.write_skill(f"skill-for-{rule}", SKILL_TEMPLATES[rule])
                mutated = True
        
        # Rule 3: Add memories about the failures
        workspace.add_memory({"content": f"Failed on rule {rule} {len(fails)} times."})
        
        return StepResult(mutated=mutated, summary="Applied mutations based on failures")

This engine is the heart of the A-Evolve process. It’s not just randomly changing things; it’s making targeted, intelligent "mutations" to the agent's workspace based on real performance data.

Let's Train! Running the Evolution Loop

Okay, we have our agent, our benchmark, and our evolution engine. It's time to put it all together and run the training loop.

First, we'll get a baseline score. We let our brand-new, unskilled agent try the train and holdout tasks to see how poorly it does. Spoiler: it's not going to be great.

benchmark = MiniTextBenchmark()
agent = ColabAEResolverAgent(WORKSPACE_ROOT, model=OPENAI_MODEL)
engine = ColabMutationEngine()

# Get our starting score
baseline_train_score, _ = evaluate_split(agent, benchmark, split="train")
baseline_holdout_score, _ = evaluate_split(agent, benchmark, split="holdout")

print(f"Baseline train score : {baseline_train_score:.3f}")
print(f"Baseline holdout score : {baseline_holdout_score:.3f}")

Now, we fire up the Evolver. We'll tell it to run for 4 cycles. In each cycle, it will:

  1. Have the agent attempt a batch of tasks from the benchmark.
  2. Collect the results (the observations).
  3. Feed those results to our ColabMutationEngine.
  4. The engine will then decide if and how to mutate the agent's workspace (by adding skills, memory, or changing the prompt).
  5. The agent hot-reloads its workspace, and the next cycle begins.
import agent_evolve as ae

config = ae.EvolveConfig(batch_size=8, max_cycles=4)
evolver = ae.Evolver(agent=agent, benchmark=benchmark, config=config, engine=engine)
result = evolver.run()

After the run is complete, we check the scores again.

The Results Are In: Did Our Agent Get Smarter?

This is the moment of truth. We reload the agent one last time (so it picks up all the final changes) and evaluate it on the same train and holdout sets.

# Reload the agent to get the final, evolved version
agent.reload_from_fs()

final_train_score, _ = evaluate_split(agent, benchmark, split="train")
final_holdout_score, _ = evaluate_split(agent, benchmark, split="holdout")

print("\n" + "=" * 110)
print("COMPARISON")
print("=" * 110)
print(f"Train : {baseline_train_score:.3f} -> {final_train_score:.3f}")
print(f"Holdout : {baseline_holdout_score:.3f} -> {final_holdout_score:.3f}")

What we typically see is a dramatic improvement. For instance, the training score might jump from around 0.25 to a perfect 1.0. The holdout score also improves significantly, showing that the agent isn't just memorizing answers but is actually learning the underlying patterns.

If we peek inside the agent's workspace, we'll see the proof:

  • The prompts/system.md file now has the strict output contract appended to it.
  • The skills directory is filled with new files, one for each rule the agent struggled with.
  • The memory/episodic.jsonl file has a log of the failures from each cycle.

The agent literally evolved. It started with a single, simple instruction and, through a process of trial, error, and guided mutation, it acquired a whole set of new skills and memories that made it far more capable.

This is what's so powerful about a framework like A-Evolve. It turns the fuzzy art of prompt engineering into a repeatable, measurable, and automated process. You're not just hoping the agent gets better; you're building a system that guarantees it does. It's a whole new way to think about building reliable AI.

Tags

AI Engineering AI Skills Custom AI Self-improving AI AI Memory AI Performance AI agents AI agent evaluation Autonomous Agents AI agent development AI Model Optimization AI Benchmarking LLM Agents A-Evolve OpenAI agent prompt

Stay Updated

Get the latest articles and insights delivered straight to your inbox.

We respect your privacy. Unsubscribe at any time.

Aicosoft

AI & Technology News, Insights & Innovation

AICOSOFT delivers cutting-edge AI news, technology breakthroughs, and innovation insights. Stay informed about artificial intelligence, machine learning, robotics, and the latest tech trends shaping tomorrow.

Connect With Us

© 2026 Aicosoft. All rights reserved.