Gelato-30B-A3B: The New AI Model That Finally Knows Where to Click

Akram Chauhan
Akram Chauhan
6 min read266 views
Gelato-30B-A3B: The New AI Model That Finally Knows Where to Click

Have you ever watched an AI agent try to navigate a computer screen and felt a little… underwhelmed? You give it a simple instruction, like "click the save icon," and it just fumbles, clicking somewhere nearby or getting confused by a pop-up. It's a common problem, and honestly, it’s one of the biggest hurdles holding back truly useful AI assistants.

If an AI can’t reliably interact with the same buttons and menus we use every day, how can we ever trust it with complex tasks?

Well, a team from ML Foundations just dropped something that might be the answer to our collective sighs. It’s a new model called Gelato-30B-A3B, and its one job is to be incredibly good at figuring out exactly where to click on a screen. Think of it less as a general-purpose AI and more as a highly-trained specialist—a sniper for your computer's interface. And from the looks of it, it's a crack shot.

So, How Does Gelato Fit Into the AI Puzzle?

To really get what makes Gelato special, you have to understand how these AI "agents" are typically built. It’s not usually one giant AI trying to do everything at once. Instead, it’s more like a team.

You have the Planner—this is the big-picture thinker, maybe something like GPT-5 in the experiments they ran. You tell it, "I need to crop this photo and email it to my boss." The planner breaks that down into a sequence of steps:

  1. Open the photo editing app.
  2. Find and click the "crop" tool.
  3. Select the area to crop.
  4. Click "apply."
  5. Save the file.
  6. Open the email client.
  7. …and so on.

But the planner doesn't actually do the clicking. That’s where the Grounding Model comes in. Its job is to take a step from the planner, like "Find and click the 'crop' tool," look at a screenshot of the current screen, and translate that instruction into a precise set of (X, Y) coordinates.

This is where Gelato lives. It’s a plug-and-play grounding model. This separation of duties is brilliant because it makes the whole system more flexible. The user interface for Photoshop on a Mac looks different than on Windows, but the high-level plan ("crop the photo") is the same. By having a specialized model for grounding, the planner doesn't have to worry about the messy details of different UIs.

Gelato itself is a 31-billion parameter model, built on top of the powerful Qwen3-VL vision-language model. You feed it a screenshot and a simple text command, and it spits out the exact spot to click. Simple, focused, and incredibly effective.

The Secret Sauce: A Super-Clean Dataset Called "Click 100k"

You've heard it a million times: an AI is only as good as the data it's trained on. This is where the Gelato team really did their homework. They didn't just scrape a bunch of random screenshots; they painstakingly built a new dataset called Click 100k.

Think of it like creating the ultimate cookbook for a chef. You don't just want a million recipes; you want the best, most reliable, and clearest recipes.

Here’s how they did it:

  1. Gathering the Ingredients: They started by pulling together a bunch of public datasets about computer UIs—things like ShowUI, AutoGUI, and SeeClick. They even analyzed over 85 video tutorials of people using professional software to capture real-world examples.

  2. The Quality Control Gauntlet: This is the most important part. They didn't just dump all this data into the training mix. Instead, they ran it through an aggressive filtering pipeline that used other AIs as quality inspectors.

    • An AI called OmniParser threw out any examples where the click didn't even land on a real UI element (like a button or link).
    • Other models, like Qwen2.5-7B-VL and UI-Venus-7B, got rid of examples that were either too easy ("click the giant 'Login' button in the middle of the screen") or where the instruction didn't actually match the click area.

This obsessive cleaning process made a huge difference. When they tested a baseline model, they found that training it on a small, balanced subset of this filtered data gave it a 9-point accuracy boost compared to training on the raw, unfiltered mess. It's a perfect example of quality over quantity.

Teaching an AI to Click With a Simple Reward System

Once they had their pristine dataset, it was time for training. The team used a reinforcement learning technique called GRPO. If "reinforcement learning" sounds complicated, just think of it like training a puppy.

You give the puppy a command ("sit"). If it sits, it gets a treat. If it does anything else, it gets nothing. There's no partial credit for almost sitting.

That’s basically how they trained Gelato. The model would be shown a screen and an instruction, and it would predict a click location.

  • If the predicted click fell anywhere inside the correct bounding box (the "target"), it got a reward.
  • If it missed—even by a single pixel—it got nothing.

This "sparse reward" system forces the model to become incredibly precise. After 100 training cycles on a cluster of 32 powerful A100 GPUs, Gelato was consistently outperforming its predecessors.

Putting It to the Test: The Results Speak for Themselves

So, does all this fancy training actually work? In a word, yes.

When benchmarked against other models, Gelato came out on top. On standard tests like ScreenSpot Pro and OS-World-G, it beat the previous state-of-the-art model, GTA1-32B. It even outperformed much, much larger models like the 235-billion parameter Qwen3-VL-Instruct.

Here are the numbers that matter:

  • ScreenSpot Pro: Gelato hit 63.88% accuracy.
  • OS-World-G: It reached 69.15% accuracy.

But benchmarks are one thing; real-world performance is another. The team plugged Gelato into a full AI agent framework, with GPT-5 acting as the planner, and set it loose on the OS World environment, which simulates real computer tasks.

The results were clear: the agent powered by Gelato succeeded 58.71% of the time, edging out the GTA1-32B-powered agent's 56.97%. When humans reviewed the results to catch successes the automated scoring missed, Gelato's success rate jumped to 61.85%.

It might not seem like a massive leap, but it proves a critical point: better grounding directly translates into more capable and reliable AI agents. Every percentage point of improvement in clicking the right button means fewer failed tasks and a more useful assistant.

Why This Is a Big Deal for All of Us

Look, Gelato-30B-A3B isn't going to change the world overnight. But it represents a really important and practical step forward. For years, we've been dazzled by what large language models can say. Now, the focus is shifting to what they can do.

Models like Gelato are the crucial link between language and action. They're the hands and eyes that allow the AI's "brain" to interact with the digital world we all live in. By making this component more accurate and reliable, we get one step closer to AI assistants that can genuinely help with our daily digital chores.

And the best part? The team has made the model and their code available on Hugging Face. This means other researchers and developers can start building with it, improving on it, and integrating this newfound precision into their own AI agents. It’s a clear new baseline for what a grounding model should be, and I, for one, am excited to see what people build with it.

Tags

AI Machine Learning Deep Learning Agentic AI Tech Breakthrough] AI Capabilities AI Research AI agents GUI Interaction AI AI Desktop Automation Computer Grounding Models Gelato-30B-A3B State-of-the-Art AI AI for Productivity AI Assistants AI Screen Navigation Human-Computer Interaction AI AI Model Performance Software Automation AI GTA1-32B

Stay Updated

Get the latest articles and insights delivered straight to your inbox.

We respect your privacy. Unsubscribe at any time.

Aicosoft

AI & Technology News, Insights & Innovation

AICOSOFT delivers cutting-edge AI news, technology breakthroughs, and innovation insights. Stay informed about artificial intelligence, machine learning, robotics, and the latest tech trends shaping tomorrow.

Connect With Us

© 2026 Aicosoft. All rights reserved.