A Hands-On Guide to Netflix's VOID: Let's Remove Any Object from a Video

Akram Chauhan
Akram Chauhan
8 min read127 views
A Hands-On Guide to Netflix's VOID: Let's Remove Any Object from a Video

Ever watched a movie and seen a perfectly clean shot, knowing full well there must have been wires, cameras, or a rogue coffee cup somewhere on set? Or maybe you’ve got a great video clip that’s almost perfect, except for that one distracting thing in the background.

For years, removing objects from video was the painstaking work of visual effects artists, costing a ton of time and money. But today, we've got AI that can do a lot of the heavy lifting. And Netflix, a company that knows a thing or two about video, has a fantastic tool called VOID (Video Object Inpainting and Denoising) that lets us do just that.

Today, we’re going to roll up our sleeves and build a working pipeline for it. This isn't just theory; we're going to set it up, run it, and see the magic happen right before our eyes. Think of me as your guide, and let's get this thing running together.

First Things First: Getting Your Workspace Ready

Before we can start making things disappear, we need to set up our digital workshop. This is the part where we install the necessary tools and get our environment in order. It’s like prepping your kitchen before you start cooking—a little setup now saves a lot of headaches later.

First, a quick heads-up: this model is a bit of a beast. The official notes recommend having over 40GB of VRAM, which means a powerful GPU like an A100 is your best friend here. You might be able to get by with less, especially with some clever offloading to the CPU, but it’ll be slow. I’m just giving you the real talk so you know what to expect!

We'll be working in a Colab-style environment. The first thing we need to do is make sure we have the right tools installed. We'll grab PyTorch, Hugging Face's libraries for model downloading, and a few other essentials.

We’ll also need a Hugging Face token to download the models. It’s basically a password that gives you access. The script will ask you for it securely. And, for a little extra credit, we’ll have an optional spot to enter an OpenAI API key. We’ll get to why that’s cool in a minute.

Once we have our keys, we’ll clone the official VOID repository from Netflix’s GitHub. This gives us all the code and sample files we need to get started.

# This is a simplified representation of the setup code.
# The original script handles all these steps for you.

print("Setting up the environment...")
# Check for GPU
# Install torch, huggingface_hub, ffmpeg, etc.

# Securely get API keys
HF_TOKEN = getpass("Enter your Hugging Face token: ")
OPENAI_API_KEY = getpass("Enter your OpenAI API key (optional): ")

# Clone the repository
run("git clone https://github.com/Netflix/void-model.git /content/void-model")
os.chdir("/content/void-model")

# Install project-specific dependencies
run(f"{sys.executable} -m pip install -q -r requirements.txt")

With that, our workspace is clean, our tools are laid out, and we're ready for the fun part.

Grabbing the Brains of the Operation

Now we need to download the AI models themselves. Think of this as getting the two key parts of our engine.

  1. CogVideoX: This is our base model from Alibaba. It’s a powerful video inpainting model, which means it’s fantastic at filling in missing parts of a video. You can think of it as the raw artistic talent. It knows how to paint, but it needs direction.
  2. VOID Checkpoint: This is Netflix’s special sauce. It’s a set of trained weights that fine-tunes CogVideoX specifically for the task of removing objects. It provides the direction, telling the artist, "Okay, I need you to paint over this object and make it look like it was never there."

We'll use the Hugging Face library to pull these down. The base model is pretty big, so grab a coffee while it downloads.

Choosing Our Target and Crafting the Perfect Prompt

With the models in place, we get to decide what we want to remove. The repository comes with a few built-in samples to play with: a lime falling, a ball rolling, and a kettlebell being placed on a pillow.

We’ll pick one of these to start. But here’s where it gets really interesting. To remove an object, we don't just tell the AI "remove the lime." Instead, we give it a text prompt describing the scene without the object. So for the lime video, a good prompt would be something like, "A table."

This is where that optional OpenAI key comes in. We can actually use a model like GPT-4o-mini to help us write a better, cleaner prompt. We can tell it, "Hey, I'm removing a glass from a scene where a lime falls on a table. Give me a clean background prompt."

It’s a neat little trick to get a more precise description, which often leads to a better result. The script automates this, sending the context to OpenAI and getting back a beautiful, clean prompt like "A lime falls onto a wooden table." We then temporarily swap this new prompt into the sample files. It’s a great example of how you can chain different AI models together to improve your workflow.

Assembling the Pipeline: Connecting All the Pieces

Alright, we have our tools, our models, and our instructions (the prompt). Now it's time to assemble the machine. This part involves loading all the different software components and plugging them into each other.

It looks like a lot of code, but let's break down what's happening:

  • VAE (Variational Autoencoder): This is the part of the model that understands video at a compressed, "latent" level. It's like a translator that turns pixels into a language the AI can understand, and then back into pixels again.
  • Transformer: This is the core of the model. It looks at the video, the mask (the area we want to remove), and the text prompt, and figures out what the new video frames should look like. This is where the VOID checkpoint we downloaded earlier gets loaded.
  • Tokenizer & Text Encoder: These components process our text prompt ("A lime falls onto a wooden table") and turn it into a format the transformer can work with.
  • Scheduler: This piece guides the "denoising" process. Diffusion models work by starting with pure noise and slowly refining it into a clear image or video. The scheduler controls how that happens over a set number of steps.

We load all these pieces, configure some settings (like video size and length), and then wrap them all up into a single, neat CogVideoXFunInpaintPipeline object. We also enable CPU offloading, which is a clever trick to save VRAM by moving parts of the model back and forth between the GPU and system memory.

Finally, we prepare our actual inputs: the video itself, the mask video (a black-and-white video showing exactly which pixels to replace), and our final prompt.

Time for the Magic: Running the VOID Inference

Deep breath. Everything is set up. All the components are loaded and connected. Now, we press the big red button.

We call our pipeline object, feeding it everything it needs:

  • The positive prompt (what we want to see).
  • A negative prompt (things we don't want to see, like watermarks or distortions).
  • The original video and the mask video.
  • Other parameters like the number of steps and guidance scale, which control the quality of the output.
# The inference call looks something like this:
sample = pipe(
    prompt,
    negative_prompt=negative_prompt,
    video=input_video,
    mask_video=input_video_mask,
    # ... other parameters
).videos

The GPU will whir to life, and the model will start its work. It's literally generating new pixels frame by frame to fill in the hole we created with the mask, guided by our text prompt to make sure what it "paints" makes sense with the rest of the scene.

The Big Reveal: Let's Check Out the Results

Once the process is finished, what do we get? A brand-new video tensor.

The script then does something incredibly helpful: it saves the output in two ways. First, it saves the final, inpainted video. This is the "after" shot, where the object is hopefully gone without a trace.

Second, and this is my favorite part, it creates a side-by-side comparison video. It stitches together the original video, the black-and-white mask, and our final output. This is a fantastic way to see exactly what the model did. You can watch the original, see the area that was targeted, and then immediately see the result, all in one synchronized view.

Being able to see the input, mask, and output together is so crucial for understanding what worked and what didn't. Did the model perfectly blend the new background? Are there any weird artifacts? This comparison view makes it super easy to tell.

And there you have it. From a cluttered command line to a clean, edited video, you've just run a state-of-the-art video inpainting pipeline. You didn't just read about it; you built it, configured it, and ran it.

What’s so powerful about this is that it’s a foundation. Now that you understand how the pieces fit together, you can start experimenting. You can try it on your own videos, play with more complex prompts, or even dig into the model's code to tweak its behavior. Welcome to the future of video editing

Tags

AI Machine Learning Deep Learning Generative AI Computer Vision AI Engineering Software Development AI for content creation AI Development Tutorial AI Pipeline Video Object Removal VFX AI Object Erasure AI Video Inpainting Netflix VOID AI Video Editing CogVideoX Custom Prompting End-to-End Inference Visual Effects Software

Stay Updated

Get the latest articles and insights delivered straight to your inbox.

We respect your privacy. Unsubscribe at any time.

Aicosoft

AI & Technology News, Insights & Innovation

AICOSOFT delivers cutting-edge AI news, technology breakthroughs, and innovation insights. Stay informed about artificial intelligence, machine learning, robotics, and the latest tech trends shaping tomorrow.

Connect With Us

© 2026 Aicosoft. All rights reserved.