Aicosoft - AI & Technology News, Insights & Innovation

Have you ever tried to automate something on a website, only for your script to break the moment a button gets moved an inch? It’s a classic headache. Most web automation tools are incredibly fragile because they rely on the website's underlying code—the HTML and the DOM structure. If a developer changes a class name or refactors a layout, your carefully crafted bot falls apart.

But what if an AI could navigate the web like we do? By just looking at the page?

That's the fascinating idea behind vision-guided web agents. Instead of digging through code, these models take a screenshot of the page, look at it, and decide what to do next. It's a much more human, and frankly, more robust way of doing things.

Today, we're going to get our hands dirty and build one of these agents ourselves. We'll be using MolmoWeb-4B, an amazing open-source model from the Allen Institute for AI (Ai2). I’ll walk you through the entire process, from setting up the model to watching it reason its way through a multi-step task. Think of this less like a formal tutorial and more like we're in a workshop together, tinkering with some really cool tech.

Getting Our Workshop Ready: The Setup

First things first, we need to set up our environment. We'll be doing this in a Colab notebook, which is perfect because it gives us access to a free GPU.

The star of the show is, of course, MolmoWeb-4B. But a 4-billion-parameter model is a bit hefty. To get it running on a standard GPU without things catching fire, we’re going to use a clever trick called 4-bit quantization.

Think of it like this: imagine you have a super high-resolution photo, but it's too big to email. You could compress it into a smaller JPEG. It loses a tiny bit of detail, but it's way smaller and still looks great. Quantization does something similar for AI models—it shrinks the model's size so it fits into memory, with a minimal hit to performance. It’s what lets us run this powerful model on a GPU with as little as 6 GB of VRAM.

After installing a few libraries like transformers, bitsandbytes (for the quantization magic), and torch, we'll load the model. With just a few lines of code, the model will be loaded onto our GPU, ready for action.

How to Talk to a Web Agent: Prompts and Actions

So, how do we actually tell the AI what to do? We can't just say, "book me a flight." We need to give it instructions in a way it understands. This is where the prompt template comes in, and it's surprisingly simple and intuitive.

The prompt we send to MolmoWeb has three main parts:

The Goal: This is the main task we want it to accomplish. For example, "Go to arxiv.org and find the latest paper about Molmo from Ai2."
Previous Steps: This is the agent's memory. We list every action it has already taken, like a breadcrumb trail. For the first step, this is empty.
The Current Page: We tell it what page it's currently looking at, along with the screenshot of that page.

It's like giving instructions to a person. You give them a goal, remind them what they just did, and show them a picture of where they are right now.

Based on this prompt, the model generates a response that also has two parts: a THOUGHT and an ACTION.

THOUGHT: This is the model's internal monologue. It explains its reasoning, like "I need to go to Google to search for this" or "I see a login button, I should click it."
ACTION: This is the specific command it wants to execute.

And what can it do? The model has a simple set of tools, its "action space," which looks a lot like what a human can do in a browser:

goto("url") - Navigate to a website.
click(x, y) - Click on a specific spot on the page (using normalized coordinates).
type("text") - Type something into a form field.
scroll("down") - Scroll the page.
send_msg("answer") - Finish the task and give the final answer.

This THOUGHT and ACTION structure is brilliant because it makes the agent's behavior understandable. We can literally read its mind before it acts.

The First Step: Staring at a Blank Page

Let's start with the simplest possible scenario. Our agent opens a new browser tab. What does it see? A blank white page: about:blank.

Our task is: "Go to arxiv.org and find the latest paper about Molmo from Ai2."

We build the prompt with this goal and show it a screenshot of a blank white page. What do you think it does? It's a "cold start" problem—there's nothing to click on.

When we run the model, it outputs something like this:

THOUGHT: I am currently on a blank page. To find a paper on arxiv.org, I first need to navigate to that website.
ACTION: goto("https://arxiv.org")

Perfect! It correctly reasoned that its first move has to be navigating to the right URL. No confusion, just a logical first step. This shows the model has a fundamental understanding of how to begin a task on the web.

Now, Let's Give It Something to See

Okay, navigating to a page is one thing. But can it interact with elements on a real page?

To test this, we won't use a real screenshot just yet. Instead, we'll create a synthetic one—a simple, fake Google search page we draw ourselves. It has a search bar and a couple of buttons.

Let's say the agent has already completed step one (going to Google). Now, its history contains goto("https://www.google.com"). The new task is: "Search Google for 'MolmoWeb Ai2 open source web agent'."

We show it the fake Google page and run the model. Here's what it comes up with:

THOUGHT: I am on the Google search page. I need to type the search query into the search bar and then press enter or click the search button. I will type the query into the prominent input field in the center of the page.
ACTION: type("MolmoWeb Ai2 open source web agent")

And then, in the next step, it would likely generate an action to click the search button, like click(0.55, 0.62).

The coolest part is that we can visualize this! We can take the (x, y) coordinates from the click action and draw a red circle right on the screenshot where the model wants to click. It's incredibly satisfying to see the AI pinpoint the exact button it needs to press, just from looking at the pixels.

Putting It All Together: A Multi-Step Journey

This is where the magic really happens. A web agent isn't useful if it can only perform one action. It needs to be able to string together a sequence of actions to complete a complex task.

Let's simulate a three-step journey with the goal: "Go to the Ai2 website and find information about MolmoWeb."

Step 1: The Blank Page

Current View: about:blank
Thought: I need to get to the Ai2 website first.
Action: goto("https://allenai.org")

We record this action in our history.

Step 2: The Ai2 Homepage

Current View: A synthetic screenshot of the Ai2 homepage with several links.
History: Contains the goto action from Step 1.
Thought: I'm on the Ai2 homepage. I see a link that says "MolmoWeb". That's probably what I'm looking for.
Action: click(0.15, 0.48) (coordinates pointing to the "MolmoWeb" link).

Awesome. It found the relevant link visually. We add this click to our history.

Step 3: The MolmoWeb Page

Current View: A fake blog post page about MolmoWeb.
History: Contains the goto and click actions.
Thought: I have found the page with information about MolmoWeb. The task is complete. I should report the information I found.
Action: send_msg("MolmoWeb is an open agent from Ai2 for automating web tasks. It achieves high scores on benchmarks like WebVoyager.")

And there you have it. In three simulated steps, our agent went from a blank page to finding the information and reporting the answer. By feeding its own actions back into the prompt as history, the model maintains context and works its way through a problem, step by step.

From Model Output to a Real-World Bot

In a real application, you wouldn't be using fake screenshots. You'd connect this model to a browser automation library like Playwright or Selenium. The loop would look like this:

Use Playwright to take a real screenshot of the current browser page.
Feed the screenshot and prompt to MolmoWeb.
Get the ACTION back from the model.
Parse the action (e.g., turn click(0.5, 0.5) into a real mouse click).
Use Playwright to execute that action in the browser.
Repeat until the model outputs send_msg().

This creates a closed loop where the AI is truly in the driver's seat, seeing the results of its actions and deciding what to do next.

This new wave of vision-based agents feels like a significant leap forward. They are less brittle, more intuitive, and operate on a principle that's much closer to how we, as humans, perceive and interact with the digital world. The fact that we can get a model like MolmoWeb-4B running in a simple notebook and see it reason visually is, frankly, incredible. It's a powerful glimpse into a future where AI can act as a true partner, navigating the complexities of the web on our behalf.

I Built an AI Web Agent That Sees Websites—Here’s How You Can, Too

Getting Our Workshop Ready: The Setup

How to Talk to a Web Agent: Prompts and Actions

The First Step: Staring at a Blank Page

Now, Let's Give It Something to See

Putting It All Together: A Multi-Step Journey

From Model Output to a Real-World Bot

Tags

Source

Stay Updated

Related Articles

SETA is the Open-Source Toolkit We Need for Smarter AI Terminal Agents

Jina AI's New Vision Model is a Multilingual Genius That's Small Enough to Run Locally

Zhipu AI's New GLM-4.6V Can See and Use Tools—This Changes Things

I Built an AI Web Agent That Sees Websites—Here’s How You Can, Too

Getting Our Workshop Ready: The Setup

How to Talk to a Web Agent: Prompts and Actions

The First Step: Staring at a Blank Page

Now, Let's Give It Something to See

Putting It All Together: A Multi-Step Journey

From Model Output to a Real-World Bot

Tags

Source

Stay Updated

Related Articles

SETA is the Open-Source Toolkit We Need for Smarter AI Terminal Agents

Jina AI's New Vision Model is a Multilingual Genius That's Small Enough to Run Locally

Zhipu AI's New GLM-4.6V Can See and Use Tools—This Changes Things

Cookie Settings