Have you ever tried to Photoshop someone out of a group photo? It’s a pain, right? You painstakingly trace around them, hit delete, and then use the clone stamp tool to fill in the background. If you’re lucky, you get a decent result. If you’re not, you get a blurry, smudged mess where your friend used to be.
Now, imagine trying to do that for a video. For every single frame.
That’s the nightmare that video editors and VFX artists have lived with for decades. Removing an object from a video is one thing, but making the scene look like the object was never there is a whole different beast. You can’t just paint over it, because the world interacts. Shadows shift, reflections warp, and most importantly, things fall down.
This is where a brilliant team from Netflix and INSAIT (at Sofia University) just stepped in and dropped something truly mind-blowing. They’ve open-sourced a new AI model called VOID, which stands for Video Object and Interaction Deletion. And it doesn't just erase pixels—it erases the object's entire physical presence from the scene.
The Floating Guitar Problem
Let’s get to the heart of what makes this so difficult. Imagine a video of a musician holding a guitar. If you use a standard AI video editing tool to remove the person, you’re often left with a bizarre sight: a guitar, floating in mid-air, defying the laws of gravity.
This is because most object-removal AIs are basically super-smart painters. They’re trained to identify the object you want to remove and then fill in the space behind it with what they think the background should look like. They’re great at handling simple things like shadows and reflections, but they have zero concept of cause and effect. They don’t understand that the person was the only thing keeping that guitar from crashing to the floor.
This is the exact problem VOID was built to solve. It doesn’t just ask, “What pixels should I fill in?” It asks, “What would have actually happened in this scene if this object never existed?”
When you use VOID to remove the musician, it also removes their effect on the guitar. The AI understands the guitar was being held, and now that the support is gone, gravity takes over. In the resulting video, the guitar falls naturally. It’s a subtle change that makes all the difference between a weird-looking special effect and a completely believable scene.
So, How Does This Magic Work? A Peek Under the Hood
Alright, so how did they pull this off? VOID is built on top of a powerful video generation model from Alibaba called CogVideoX. Think of it like a video version of an image generator like Stable Diffusion—it works with sequences of frames instead of just a single picture.
But the real secret sauce, the thing that gives VOID its "physics brain," is an innovation called the quadmask.
When you use a normal editing tool, you typically create a binary mask. It’s a simple black-and-white stencil. The white part says "keep this," and the black part says "delete this." Simple.
VOID’s quadmask is way more sophisticated. Instead of just two options, it gives the AI a four-part blueprint for understanding the scene:
- Primary Object: "This is the main thing you need to remove." (e.g., the musician)
- Overlap Region: "This is where the main object and the affected objects touch." (e.g., the musician's hands on the guitar)
- Affected Region: "This is the other stuff that's going to change or move because the primary object is gone." (e.g., the guitar that's about to fall)
- Background: "Leave all of this stuff alone." (e.g., the wall, the floor, the furniture)
By giving the AI this incredibly detailed, color-coded map, you’re not just telling it what to erase. You’re giving it the context it needs to understand the physical relationships between objects. It’s the difference between telling a contractor to "knock down that wall" versus giving them a full architectural plan that shows which other beams will need new support.
A Two-Step Process for a Flawless Finish
Even with this clever mask, generating perfectly stable and consistent video is hard. Video AI models sometimes have a weird quirk where objects can seem to slowly warp or "melt" across frames. It’s a known issue.
To combat this, the VOID team developed a two-pass system.
- Pass 1: This is the main workhorse. You feed it the video and the quadmask, and it does the heavy lifting of removing the object and simulating the new physics. For most videos, this is all you’ll need.
- Pass 2: This is an optional "polishing" step. If the first pass results in any of that weird object-morphing, you can run this second pass. It cleverly analyzes the motion from the first video to guide a second attempt, essentially locking the shape of the newly created objects (like the falling guitar) so they stay solid and consistent from frame to frame.
Think of it like writing a draft and then doing a final proofread. Pass 1 gets the story right, and Pass 2 cleans up any awkward phrasing to make it perfect.
How Do You Train an AI on Impossible Videos?
This might be the most fascinating part of the whole project. To teach an AI to understand physics, you need to show it tons of "before and after" examples. You’d need a video of a scene with an object, and then the exact same video of the exact same scene without the object, where everything else reacts realistically.
Here’s the catch: that kind of data doesn’t exist in the real world. You can't just ask an actor to hold a guitar and then film them again not holding it while the guitar magically falls on its own.
So, the researchers had to get creative. They built their own training data from scratch using hyper-realistic 3D simulations.
They used two main tools. The first, called HUMOTO, used motion-capture data and the 3D software Blender to create scenes of humans interacting with objects. They would render the scene once with the person in it. Then, they’d go back into the simulation, delete the digital human, and re-run the physics engine. This created a perfectly paired, physically accurate "counterfactual" video where the objects behaved exactly as they should.
The second tool, Google’s Kubric, did the same thing but for object-on-object interactions, like one box knocking over another.
By generating thousands of these synthetic video pairs, they created the perfect textbook for VOID to study from. The AI could directly compare the "before" and "after" and learn the subtle rules of causality and physics that govern our world.
This is more than just a cool new tool for Hollywood. It’s a fundamental step forward in getting AI to not just see the world, but to understand it. By teaching models about cause and effect, we’re moving from AIs that are simple pattern-matchers to ones that can reason about the world in a way that feels a lot more intuitive, and a lot more human. And the fact that Netflix is sharing this with everyone? That’s just awesome.




