This AI Can Literally Engineer and Improve Other AIs While You Sleep

Akram Chauhan
Akram Chauhan
6 min read101 views
This AI Can Literally Engineer and Improve Other AIs While You Sleep

If you’ve spent any time building with AI agents, you know the grind. I’m talking about the endless, soul-crushing loop of prompt-tuning. You write a system prompt, run a test, see it fail in a new and creative way, tweak the prompt, maybe add a new tool, and run it all over again. It’s like a digital Groundhog Day, and after the 50th iteration, you start to question your life choices.

It’s the necessary grunt work of AI engineering, but let’s be honest—it’s not the fun part.

But what if you didn't have to do it? What if you could just… delegate it? Not to a junior engineer, but to another AI. That’s the wild, slightly unsettling, and brilliant idea behind a new open-source library called AutoAgent. Created by Kevin Gu, this tool proposes a simple solution: let an AI engineer itself overnight.

And the crazy part? It works. In a 24-hour, hands-off run, AutoAgent built an agent that hit #1 on SpreadsheetBench with a 96.5% score and achieved the top GPT-4 score on TerminalBench at 55.1%. It’s not just a cool concept; it’s beating human-tuned agents at their own game.

So, What Is AutoAgent, Really?

Think of it this way: AutoAgent is to AI agents what automated testing is to software. It’s a framework for self-improvement.

The project’s description nails it: it’s “like autoresearch but for agent engineering.” If you’re not familiar, Andrej Karpathy’s concept of autoresearch was about automating machine learning research. The idea was to have a program that could propose a change to a model, train it, evaluate the results, and only keep the change if it made things better. It’s a simple, relentless loop of improvement.

AutoAgent takes that exact same "keep what works" logic and applies it not to model weights, but to the agent harness.

What’s a harness? It’s all the scaffolding we build around a powerful language model to make it useful. It’s the system prompt that gives it a personality and instructions, the tools it can use (like a calculator or a web search), and the logic that decides how to handle a complex task. Most of us spend weeks, if not months, hand-crafting this scaffolding.

AutoAgent automates that entire process. It lets an AI tweak the harness, test it, and improve it, all on its own.

How It Works: Two AIs, One File, and a Simple Goal

When you peek inside the GitHub repo, you might be surprised by how simple the core structure is. It’s elegant, really, and boils down to a clear separation of duties between you and the AI.

1. The Agent Being Tested (agent.py) This is a single Python file that contains the entire agent harness—the configuration, tool definitions, prompts, everything. Think of this as the "student" agent that’s trying to get better at a task. The meta-agent will be rewriting this file over and over again.

2. The Human's Instructions (program.md) This is the only file you, the human, are supposed to touch. It’s a simple Markdown file where you write a directive. You don’t tell the AI how to build the agent; you just tell it what kind of agent to build. For example: "Build an agent that is an expert at working with spreadsheets." You’re the director setting the vision, not the programmer writing the code.

3. The "Meta-Agent" (The Brains of the Operation) This is the higher-level AI that does all the work. It reads your directive in program.md, looks at the current agent.py, and runs it against a benchmark. Then, it analyzes the failures and says, "Hmm, it looks like the prompt is confusing. Let me try rewriting it." It edits agent.py, runs the test again, and checks the score.

If the score improves, great! It keeps the change. If the score gets worse, it throws the change away and tries something else. It does this over and over, thousands of times, constantly "hill-climbing" toward a better score. To keep track of everything, it maintains a log file (results.tsv), which acts as its memory of what’s worked and what hasn’t.

The Secret Sauce: A Clear Way to Keep Score

This whole self-improvement loop only works if the AI has a clear, objective way to know if it's getting better. That's where the benchmark system comes in.

AutoAgent uses the Harbor format for its tasks, which is just a standardized way to package up a problem. Each task is a folder containing:

  • An instruction file (what the agent is supposed to do).
  • A test script that runs the agent's output and determines if it's correct.
  • Any necessary files or a Docker environment to run in.

The magic is in that test script. At the end of its run, it just has to write a single number—a score between 0.0 and 1.0—to a log file. That's it. That simple number is the "reward" the meta-agent is chasing.

One of the coolest parts of this is the support for "LLM-as-judge." For a lot of agent tasks, there isn't one single "right" answer. A simple text match won't work. Instead, the test script can actually call another LLM and ask it, "Hey, does this output successfully accomplish the goal?" This allows for much more flexible and nuanced evaluations, which is critical for complex, real-world tasks.

Why This Is a Bigger Deal Than It Looks

Okay, so it’s a clever bit of engineering. But let’s zoom out. What does this actually mean for those of us building in this space? I think there are a few huge takeaways here.

  • Autonomous engineering is here. The proof is in the pudding. AutoAgent didn't just work in theory; it topped leaderboards on established benchmarks. This confirms that a meta-agent can effectively replace the tedious human-in-the-loop process for optimizing agent performance.
  • Our jobs are changing, not disappearing. This is a perfect example of how AI shifts our role. We move from being the person on the assembly line (tweaking prompts) to being the factory manager (defining the goals and benchmarks). Your value is no longer in your ability to write the perfect prompt but in your ability to design a great problem for the AI to solve.
  • "Model Empathy" might be a real thing. One of the most fascinating observations from the project was that a Claude-powered meta-agent seemed to be better at diagnosing and fixing a Claude-based task agent. This suggests that models from the same "family" might have an intuitive understanding of each other's failure modes. It's a strange and interesting idea that could influence how we design these systems in the future.
  • It’s incredibly versatile. Because it’s all built on Docker and a standard task format, you can point AutoAgent at pretty much any problem you can dream up, as long as you can create a script that spits out a score. Financial analysis, code generation, customer support routing—if you can score it, you can automate its improvement.

AutoAgent feels like a glimpse into the future of AI development. We’re not just building applications anymore; we’re building systems that build and refine themselves. The human role becomes less about the tedious, iterative work and more about providing the high-level direction and creative vision.

And honestly, I'm all for it. I’d much rather spend my afternoon designing a clever new benchmark than spending it trying to figure out why my agent added a smiley face to a JSON output for the 100th time.

Tags

AI Machine Learning LLMs Agentic AI AI Engineering Prompt Engineering AI Research Open Source AI Software Development AI Productivity AI development Tech Breakthroughs AI agents AI Tools AI Automation AI Benchmarking AutoAgent Self-optimizing AI AI agent optimization Kevin Gu

Stay Updated

Get the latest articles and insights delivered straight to your inbox.

We respect your privacy. Unsubscribe at any time.

Aicosoft

AI & Technology News, Insights & Innovation

AICOSOFT delivers cutting-edge AI news, technology breakthroughs, and innovation insights. Stay informed about artificial intelligence, machine learning, robotics, and the latest tech trends shaping tomorrow.

Connect With Us

© 2026 Aicosoft. All rights reserved.