Let's be honest for a second. Building and testing AI agents that can actually do things in the real world is… well, it’s messy. You can have an agent that seems brilliant in a demo, but then it completely falls apart when faced with a real developer's command line.
How do you really know if your agent is getting better? How can you compare it to someone else’s? It often feels like we're trying to time a race with a sundial.
Well, it looks like things are about to get a lot more rigorous. The team behind Terminal-Bench, a popular test for these developer-focused AI agents, just dropped a major one-two punch: Terminal-Bench 2.0 and a brand-new framework called Harbor.
This isn't just another small update. It’s a genuine attempt to bring some standardization and scalability to a field that desperately needs it. If you’re building, researching, or just curious about where AI agents are headed, you’ll want to pay attention to this.
Why the Old Way Was Starting to Crack
You’ve probably heard of the original Terminal-Bench. After it launched back in May 2025, it quickly became the go-to obstacle course for AI agents. The idea was simple: give an agent a series of tasks that a developer would do in their terminal—the black screen with text that powers so much of the internet—and see how well it performs.
It was a huge step forward! But like any version 1.0, it had its quirks.
Think of it like the first season of a great TV show. The core idea was brilliant, but some of the plot lines were a bit shaky. The community found that some tasks were a little vague or unstable. For example, a task might rely on an external website or API that changed without warning, causing the test to fail for reasons that had nothing to do with the agent's intelligence.
A classic example was the download-youtube task. It was a cool idea, but its success depended entirely on third-party tools that are notoriously unstable. An agent could do everything right and still fail. That’s not a good test.
A Higher Bar and Cleaner Data: What’s New in Terminal-Bench 2.0
So, the team went back to the drawing board. Terminal-Bench 2.0 isn't just a patch; it's a complete overhaul focused on quality.
The new version features 89 tasks, and each one has been put through the wringer. We're talking hours of manual testing and even using LLMs to help validate them. The new philosophy is all about making tasks that are:
- Solvable: There’s a clear, achievable path to success. No more trick questions.
- Realistic: These are things a human developer might actually do.
- Clearly Specified: The instructions are crystal clear, so the agent isn’t failing because of a misunderstanding.
The result? The whole benchmark is harder, but it's also fairer. It’s a much more reliable signal of an agent's true capabilities.
Alex Shaw, one of the co-creators, pointed out something interesting on X (formerly Twitter). He said that the top-performing agents on the new, harder benchmark are getting scores similar to what they got on the old one. "We believe this is because task quality is substantially higher in the new benchmark," he explained.
That’s a fantastic insight. It suggests the old scores might have been a little inflated by easier or less reliable tasks. Now, we’re getting a much more honest picture.
Meet Harbor: The Engine Room for Agent Testing
Okay, so we have a better, tougher obstacle course. But how do you run your agents through it, especially if you want to run thousands of tests? Doing that on your laptop would take forever.
That’s where Harbor comes in.
If Terminal-Bench is the racetrack, Harbor is the high-tech pit crew, garage, and logistics team all rolled into one. It’s a framework designed specifically to run these agent evaluations in containerized environments, at scale.
"Harbor is the package we wish we had had while making Terminal-Bench," Shaw wrote. You can feel the relief in that statement. It was born from their own pain of trying to manage tens of thousands of test runs while building the new benchmark.
Here’s what Harbor brings to the table:
- Test Any Agent: As long as you can install your agent in a container (which is pretty standard), you can test it with Harbor.
- Scale Like Crazy: It’s built to work with cloud providers like Daytona and Modal, so you can spin up thousands of containers and get results fast.
- More Than Just Testing: You can use it to scale up your training pipelines, whether you’re doing supervised fine-tuning (SFT) or reinforcement learning (RL).
- Build Your Own: You can even use it to create and deploy your own custom benchmarks.
This is a huge deal. It democratizes large-scale agent evaluation. You no longer need a massive, custom-built internal infrastructure to rigorously test your models.
The First Results Are In: Who's on Top?
With a new benchmark comes a new leaderboard, and the initial results are fascinating.
Right now, OpenAI's Codex CLI, a variant powered by GPT-5, is sitting at the top with a 49.6% success rate. It’s the only agent to get close to solving half the tasks.
Here's a quick look at the top 5:
- Codex CLI (GPT-5) — 49.6%
- Codex CLI (GPT-5-Codex) — 44.3%
- OpenHands (GPT-5) — 43.8%
- Terminus 2 (GPT-5-Codex) — 43.4%
- Terminus 2 (Claude Sonnet 4.5) — 42.8%
What I find most interesting here is that nobody is crushing it. The best agent is still failing more than it succeeds. This tells us that Terminal-Bench 2.0 is a genuinely challenging test for even the most advanced models out there. There's a ton of room for improvement, and the competition between the top models from OpenAI and Anthropic is clearly heating up.
How You Can Get Involved and Start Testing
The best part about all this is that it’s open for everyone. The team wants you to put your own agents to the test.
Getting started is pretty straightforward. You just install Harbor and run a simple command from your terminal:
harbor run -d terminal-bench@2.0 -m "<model>" -a "<agent>" --n-attempts 5 --jobs-dir <path/to/output>
If you want to get your agent on the official public leaderboard, you’ll need to run the benchmark five times and then email the developers your results for validation. It’s a simple process designed to get the community involved.
This move toward standardization is exactly what the field of AI agents needs right now. As these models get integrated into more of our tools, we have to move beyond flashy demos and toward reproducible, tough, and fair evaluations.
Terminal-Bench 2.0 and Harbor feel like a massive step in that direction. They provide a common ground where developers and researchers can measure progress, identify weaknesses, and ultimately build agents that are not just clever, but genuinely useful and reliable. It’s the foundation for the next wave of agentic AI, and I, for one, can't wait to see what gets built on top of it.




