Have you ever asked an AI to analyze a picture, only to have it completely miss the one tiny detail that mattered? Maybe it was a serial number on a tiny chip or a symbol on a complex blueprint. The AI gives you an answer, and it sounds so sure of itself, but it’s just… wrong. It took a single glance, missed the crucial part, and decided to guess.
We’ve all been there. It’s one of the most frustrating parts of working with today’s powerful AI models. They’re amazing at the big picture, but can stumble on the fine print.
Well, Google just rolled out something in Gemini 3 Flash that feels like a genuine fix for this. It’s called “Agentic Vision,” and it fundamentally changes how the AI “sees.” Instead of just taking a single, passive look at an image, it turns the process into an active investigation. Think of it less like a quick glance and more like a detective arriving at a scene, pulling out tools, and examining the evidence before drawing any conclusions.
So, What Exactly is Agentic Vision?
At its core, Agentic Vision gives Gemini 3 Flash the ability to use tools—specifically, Python code—to interact with an image. It’s no longer just a static input. The model can now formulate a plan, run some code to manipulate or analyze the image, and then look at the new version of the image before giving you an answer.
This is a huge deal. It means the AI isn't stuck with its first impression. If it needs a closer look, it can literally zoom in. If it needs to count things, it can draw on the image to keep track. It’s an active, cyclical process.
Google built this around a simple but powerful idea they call the “Think, Act, Observe” loop.
The Detective’s Playbook: Think, Act, Observe
Let’s break down how this actually works. It’s surprisingly intuitive.
-
Think: First, Gemini analyzes your request and the image you provided. It then comes up with a multi-step plan. It’s not just jumping to an answer. It might decide, "Okay, to answer this, I first need to zoom into the top-right corner, then I need to read the text in that box, and finally, I'll count the items in the bottom-left."
-
Act: This is where the magic happens. The model generates and executes Python code to carry out its plan. This isn't just theoretical; it’s actually running code to do things like crop a specific section, rotate the image for a better view, or even run calculations on data it extracts.
-
Observe: The newly transformed images (the zoomed-in crops, the annotated versions) are fed back into the model's context. It’s like adding new evidence to the case file. The model then re-examines everything with this new, more detailed information and finally gives you a much more grounded, accurate answer.
The result? Google is reporting a 5–10% quality boost across most of their vision benchmarks. In the world of AI, that’s a pretty significant leap, especially for real-world, production-level tasks.
Getting a Closer Look at Massive Blueprints
Okay, that’s the theory. But what does this look like in practice?
One of the coolest examples Google shared involves a company called PlanCheckSolver.com. They use AI to validate massive, high-resolution building plans against complex building codes. As you can imagine, the devil is in the details—a tiny symbol or a measurement tucked away in a corner can mean the difference between compliance and a costly mistake.
Before, an AI might have to look at a down-sampled, blurry version of the plan and just do its best. But with Agentic Vision, Gemini can generate Python code to crop and zoom in on specific patches of the plan, like a roof edge or a structural section.
These high-res crops are treated as new images and fed back to the model. Now, instead of squinting at a giant, blurry map, the AI is looking at a series of crystal-clear close-ups. Based on this much better evidence, it can accurately check for compliance. The result for PlanCheckSolver? A solid 5% jump in accuracy. That’s a real-world win.
Using the Image as a Visual Notepad
Another brilliant use case is turning the image into a kind of digital scratchpad.
Imagine you show the AI a picture of a hand and ask, "How many fingers are being held up?" A simple task for us, but for an AI, it can sometimes lead to double-counting or other silly errors.
With Agentic Vision, the model can execute a Python script to draw bounding boxes around each finger it detects and even add a numeric label on top of each one. This annotated image is then sent back for review. The AI isn't just guessing anymore; it’s looking at an image where it has clearly marked "1," "2," "3," "4," "5." The final count is based on its own methodical work, dramatically reducing the chance of a simple mistake.
Finally, an AI That's Good at Visual Math
Let's be honest: large language models can be notoriously bad at math, especially when they have to pull numbers out of a chart or table in an image. They often "hallucinate" or just make things up.
Agentic Vision tackles this by delegating the work. It lets the AI do what it’s good at (perception and planning) and offloads the heavy lifting to a tool that’s perfect for the job (Python).
In one of Google’s demos, Gemini is shown a dense table of performance metrics in a screenshot. Here’s the workflow:
- Gemini looks at the image and successfully parses the table, identifying all the raw numbers it needs.
- Instead of trying to do the math itself, it writes a Python script using Matplotlib (a popular charting library).
- The script normalizes the data and generates a clean, simple bar chart to visualize the relative performance.
The final answer you get isn't just a number the AI pulled out of thin air. It’s an answer grounded in a computed result and a generated plot that you can see for yourself. It’s a clear, trustworthy separation of duties.
How You Can Start Using Agentic Vision
The best part is that this isn't some far-off research project. It's available right now. If you're a developer, you can find it in:
- Google AI Studio: You can enable "Code Execution" under the Tools section in the Playground and start experimenting.
- Vertex AI: The same capability is available through the Gemini API for more production-focused work.
And for everyone else, Google is starting to roll this out in the main Gemini app. It’s another step toward making AI not just more powerful, but more reliable, methodical, and—dare I say—thoughtful. It’s an AI that knows when to take a second look.




