You’ve probably seen them by now—those mind-bendingly realistic videos generated by AI from just a few words of text. A dog skateboarding on Mars, a historical event reimagined, you name it. They’re incredible, but they all share a fundamental limitation.
They’re one-and-done.
The AI generates a short clip, and that's it. The story ends. There's no memory of what just happened and no way to influence what happens next. It's like a movie that’s only one scene long. What if you wanted to tell the skateboarding dog to "turn left" or "do a kickflip"? You can't. The world inside that video vanishes the moment it’s created.
Well, that’s the exact problem a new model from researchers at MBZUAI’s Institute of Foundation Models is trying to solve. It’s called PAN, and it’s not just another video generator. It’s designed to be a "general world model"—a persistent, interactive simulation that you can influence over time.
Think of it less like a movie director and more like a video game engine.
From Video Maker to World Simulator
So, what makes PAN so different? It all comes down to one core idea: it maintains an internal "world state."
Imagine you’re playing a video game and you hit the save button. The game saves everything—your character's position, health, inventory, the state of the world around you. PAN does something similar. It keeps a latent (hidden) representation of its world in memory.
When you give it a command in plain English, like "turn the car left and speed up," it doesn't start from scratch. It takes its current world state, applies your action, and calculates the next world state. Then, and only then, does it generate a short video clip showing you the result.
This cycle can repeat over and over. The same world evolves across many steps, guided by your instructions. This opens up a whole new world of possibilities. You could test out different futures ("what if I turn left vs. right?"), and an AI agent could use PAN as its personal sandbox to practice and plan its actions before trying them in the real world.
The Brains of the Operation: How PAN Thinks and Sees
Alright, this sounds cool, but how does it actually work under the hood? It’s built on an architecture called Generative Latent Prediction, or GLP. Let’s break that down without getting too lost in the jargon.
You can think of it as having three main parts that work together:
- The Eyes (Vision Encoder): This part looks at a video frame and converts it into the model’s internal language—that latent world state we talked about. It essentially creates a digital blueprint of what's happening.
- The Brain (Latent Dynamics Backbone): This is where the thinking happens. It’s a large language model that takes the current world state, your text command ("speed up!"), and the history of what’s happened, and predicts the next world state.
- The Artist (Video Decoder): This part takes the new, predicted world state from the brain and translates it back into pixels, creating the next video segment for you to see.
To build this, the team didn't start from scratch. They stood on the shoulders of giants. The "Eyes" and "Brain" are built on Qwen2.5-VL-7B-Instruct, a powerful model that's already great at understanding both images and text. The "Artist" is an adapted version of Wan2.1-T2V-14B, a diffusion model known for creating stunningly high-quality videos.
By separating what happens (the dynamics in the brain) from how it looks (the rendering by the artist), PAN can focus on creating a consistent and logical world.
The Secret to Not Messing Up Long Videos
Here’s a huge challenge with generating video over long periods. If you just create one clip, then use its last frame to start the next one, things fall apart. Fast.
Tiny errors and inconsistencies build up, and before you know it, the video becomes a glitchy, nonsensical mess. It’s like a game of telephone where the message gets distorted with each retelling.
PAN has a clever solution for this called Causal Swin DPM. It’s a bit of a mouthful, but the concept is brilliant. Instead of just looking at the single last frame, the video decoder works on a sliding window of video chunks.
Think of it like an animator who keeps the last few drawings visible to make sure the next one flows smoothly. The model is always looking at a small chunk of the recent past to generate the immediate future. This "causal attention" means it can only look backward, not into the future it hasn't created yet, which keeps everything consistent.
They also added another smart trick: they feed the model a slightly noisy version of the last frame instead of a perfectly clean one. Why? It forces the model to ignore tiny, irrelevant pixel details and focus on the important stuff—the objects, their positions, and the overall structure of the scene. This helps prevent that slow drift into chaos.
Training a Digital World Modeler
As you can imagine, teaching an AI to do all this is a massive undertaking. The team trained PAN in two main stages.
- Stage 1: Teach the Artist. First, they took the powerful Wan2.1 video model and adapted it to their new Causal Swin DPM architecture. This part of the training alone used a whopping 960 NVIDIA H200 GPUs.
- Stage 2: Connect the Brain. Next, they froze the Qwen2.5 language model (the brain) and plugged it into the newly trained video decoder (the artist). They then trained the connecting parts to ensure the brain's predicted world states could be accurately turned into video by the artist.
The data they used was just as important. They didn't just grab random clips from the internet. They sourced long videos, broke them into logical scenes, and filtered out anything low-quality or irrelevant. Crucially, they re-captioned the clips with dense descriptions of the actions and causal events happening. This taught the model the link between a command ("a person picks up a red block") and the visual result.
So, Does It Actually Work? The Results Are In
The researchers put PAN through a tough series of tests, comparing it against other leading open-source models and even some heavy-hitting commercial ones like Gen-3 and KLING.
Here’s how it stacked up:
- Action Fidelity (Does it do what you say?): PAN was the best among all open-source models at correctly following commands, scoring 70.3% accuracy in simulating an agent's actions.
- Long-Horizon Stability (Does it stay consistent?): This is where PAN really shines. Thanks to its clever architecture, it scored highest on metrics for "Transition Smoothness" and "Simulation Consistency," beating all baselines, including the commercial ones. The world it creates just holds together better over time.
- Planning (Can it "think ahead"?): When used inside another AI agent as a simulator for planning, PAN again achieved the best results among open-source world models.
Why This Is More Than Just Another Cool AI Demo
Look, it’s easy to get numb to new AI models dropping every week. But PAN feels different. This isn’t just about making prettier videos. It’s a foundational step toward creating AI that can understand, predict, and interact with a dynamic world.
What the MBZUAI team has done is show how you can combine a powerful vision-language model with a state-of-the-art video generator to create something that’s more than the sum of its parts. It’s not a generative toy; it’s a functional world model.
The fact that they’ve been so transparent about their methods, benchmarks, and architecture is a huge plus. It moves the whole field forward. We’re still in the early days, but models like PAN are paving the way for more capable robots, smarter autonomous systems, and incredibly rich simulation environments for everything from games to scientific research. It's a glimpse into a future where AI doesn't just describe our world, but can simulate its possibilities.




