Have you ever played a video game and thought about how the enemies seem to know where they're going? Or watched a video of a robot navigating a cluttered room and wondered, "How is it thinking?"
It often feels like magic. But what if I told you the core idea is something you do every single day?
When you're driving, you’re constantly running little simulations in your head. "If I switch lanes now, will I have enough space?" "If I take this exit, will I hit traffic?" You're predicting the future based on what you see, and then you act.
We can actually teach an AI to do the same thing. And today, we're going to build a tiny one from the ground up. We'll create a simple agent that learns to "see" its world through raw pixels, build a mental model of how that world works, and then use that model to "imagine" the future and make smart decisions.
Forget massive, complicated robotics libraries. We're doing this with simple Python and PyTorch. Let's get our hands dirty.
First, We Need a Playground
Before our AI can learn, it needs a world to live in. We could use a fancy game engine, but that's overkill and adds a lot of complexity. Instead, we're going to build our own little "Grid World" completely from scratch.
Think of it like a simple chessboard. We'll have:
- An agent (the blue square)
- A goal (the green square)
- Some obstacles (the red squares)
The crucial part? We're not going to cheat. We won't feed the agent its (x, y) coordinates. Nope. Our agent will only get to see what we see: an RGB image of the grid. It has to learn to understand its position and navigate the world just by looking at pixels, kind of like a real robot would see through a camera.
We'll use NumPy to render this world. This keeps things incredibly lightweight and means we have full control over everything. We define the rules: the agent can move up, down, left, right, or stay put. If it tries to walk into a wall or an obstacle, it just stays where it is. Simple.
Time for School: Gathering Experience
Okay, we have a world. Now, how does the agent learn its rules?
Imagine a toddler exploring a new room. They don't have an instruction manual. They just move around randomly, touching things, falling over, and slowly building an internal model of how physics works. We're going to do the exact same thing with our agent.
We'll let it take thousands of random actions in our grid world. For every single step it takes, we'll record a "transition":
- What did the world look like before the action? (Image
t) - What action did it take? (e.g., "move right")
- What did the world look like after the action? (Image
t+1)
We collect a huge dataset of these (before, action, after) experiences. This data is gold. It contains all the implicit rules of our world—like "if you're next to a red block and you try to move into it, you don't actually move." The agent will use this data to build its brain.
Building the Agent's Brain: A "World Model"
This is where the real magic happens. We're going to build a neural network that acts as the agent's internal "world model." This model has a few key jobs, which might sound familiar if you've heard of models like OpenAI's Sora or Google's VLOGGER.
Our model has three main parts:
-
The Encoder (The Eyes): This is a convolutional neural network (CNN). Its job is to take the raw, high-resolution image of the grid and compress it down into a small, dense bundle of numbers. We call this a "latent representation" or a "latent state." You can think of it as the agent's internal "thought" or summary of what it's seeing. It's a much more efficient way to think about the world than dealing with thousands of pixels.
-
The Decoder (The Imagination): The decoder does the exact opposite. It takes one of those compressed "thoughts" (a latent state) and learns to reconstruct the original image from it. This is a powerful sanity check. If our agent can successfully turn a scene into a thought and then turn that thought back into the original scene, we know it's capturing the important information.
-
The Dynamics Model (The Physics Engine): This is the heart of the operation. This part of the network learns to predict the future, but it does it entirely in the compressed "thought space." It answers the question: "If the world is currently in this latent state, and I take this action, what will the next latent state look like?"
When we put it all together, we have a model that can look at the world, understand it, and predict how it will change.
Training: Teaching the Agent to Think
Now we just need to train this brain. We show our model the massive dataset of transitions we collected earlier. For each example, we ask it to do two things:
- Given the "before" image and an action, predict the "after" image.
- Given the "before" image and an action, predict the agent's coordinates in the "after" state.
We then calculate a "loss"—a measure of how wrong its predictions were compared to the actual outcome. If its predicted image looks nothing like the real next frame, the loss is high. We use this loss signal to slowly tweak the model's parameters until its predictions get better and better.
After a few minutes of training, our model becomes a pretty decent simulator. It has learned the basic "physics" of our grid world, all on its own, just from watching random gameplay.
The Master Plan: Thinking Ahead with Model Predictive Control
So, our agent has an imagination. Now what? How does it use that to actually achieve a goal?
This is where a beautiful algorithm called Model Predictive Control (MPC) comes in. It's way simpler than it sounds.
At every single step, the agent stops and thinks. It doesn't just pick one action. Instead, it runs a bunch of "what-if" scenarios entirely inside its head using its trained world model.
-
Brainstorm: It generates a bunch of random action sequences. For example:
- Plan A: Up, Up, Right, Left, Stay
- Plan B: Right, Right, Right, Up, Up
- Plan C: Left, Stay, Down, Down, Right
- ...and so on, maybe a hundred different plans.
-
Simulate: For each plan, it uses its dynamics model to predict the future. It "imagines" the sequence of latent states that would result from following that plan.
-
Evaluate: After simulating each plan, it looks at the final predicted state and asks, "How close did this plan get me to the goal?"
-
Execute: It finds the plan that resulted in the best outcome (i.e., got closest to the green square). And here's the key: it only executes the first step of that winning plan.
Then, it repeats the entire process. It looks at the world again, brainstorms a new set of plans from its new position, simulates, evaluates, and takes the single best next step.
This is so powerful because it means the agent is constantly re-planning. If something unexpected happens, it doesn't matter, because at the very next step, it will re-evaluate everything from scratch. It's a closed-loop system that is both simple and incredibly robust.
Let's See It in Action
When we run the final agent, it's amazing to watch. It's not just blindly following a path. At each step, you can see its internal prediction for what will happen next if it takes its chosen action. You can literally see it "thinking" one step ahead.
It navigates around obstacles and finds its way to the goal with surprising efficiency. And remember, no one ever programmed the rules of the world for it. It learned the dynamics and how to plan all by itself, just from pixels.
What we've built here is a tiny, self-contained example of the principles that power much larger, more complex AI systems. This idea of learning a world model and using it to search for a good plan is a cornerstone of modern robotics and reinforcement learning. It’s a glimpse into how we can build agents that don't just react to the world, but actively reason about their place within it. Pretty cool, right?




