Meta's New PE-AV Model Teaches AI to Understand Video and Sound Together

Akram Chauhan
Akram Chauhan
7 min read192 views
Meta's New PE-AV Model Teaches AI to Understand Video and Sound Together

Have you ever watched a movie with the sound off? It’s a completely different experience, right? The crash of a wave, the dialogue between characters, the swell of a musical score—it all adds context that the visuals alone can’t provide. We humans do this automatically. We see a dog, we hear a bark, and our brain instantly connects the two.

For AI, this has always been a massive challenge. You’ve had models that are great at understanding images, others that are wizards with text, and some that can process audio. But getting them to understand all three together, in a deeply connected way, has been the holy grail.

Well, it looks like the folks at Meta AI just took a giant leap in that direction. They've just open-sourced a new model called Perception Encoder Audiovisual, or PE-AV for short. And honestly, it’s one of the most exciting developments I’ve seen in a while. It’s the core tech that’s making some of their other cool projects, like SAM Audio, possible.

So, let's pull back the curtain and see what makes this thing tick.

So What Exactly is PE-AV?

Think of Meta's AI research as having a core family of models called the "Perception Encoder" (PE) family. This family has specialists. There’s a PE model that’s a rockstar at understanding images, another that’s a pro with video, and so on. They’re all best-in-class at their specific jobs.

PE-AV is the newest member of the family, and its special skill is being a true audiovisual expert. It’s designed to learn from audio, video, and text all at the same time and understand how they relate to each other.

The goal is to create a single, unified "embedding space." That sounds technical, but the concept is pretty simple. Imagine a giant library where every book is a piece of data. PE-AV’s job is to organize this library so that the video of a "waterfall," the sound of a "waterfall," and the text "waterfall" are all sitting right next to each other on the same shelf.

By doing this, the AI doesn't just process a video; it understands it in a much more human-like way.

How Did They Build This Thing?

The architecture behind PE-AV is pretty clever. Instead of trying to cram everything into one giant processor, they took a "team of specialists" approach.

  • The Video Expert: It has a dedicated "tower" that processes video. It first looks at individual frames (like photos) and then analyzes how those frames change over time to understand motion and action.
  • The Audio Expert: Another tower is dedicated to sound. It takes raw audio waves and converts them into neat little digital tokens, kind of like turning a spoken sentence into written words for easier analysis.
  • The Fusion Manager: The outputs from the video and audio towers then go to a "fusion encoder." This is like the team manager who takes reports from both specialists and creates a single, unified summary that captures the essence of both.
  • The Text Translator: Finally, a text encoder helps connect all of this to human language, allowing you to search for a video of a "cheering crowd" or the sound of a "strumming guitar" using just words.

This setup means you get one powerful model that can handle all sorts of requests. You can find a video based on a sound, find an audio clip based on a text description, or find a video based on a text description. It’s incredibly flexible.

The Secret Weapon: A "Data Engine" That Makes Its Own Labels

Here’s the part that really blew my mind. To train an AI like this, you need a mind-boggling amount of data—specifically, videos that have been carefully labeled with descriptions of both what’s happening and what you’re hearing. Meta trained PE-AV on about 100 million video clips.

Now, can you imagine trying to pay humans to watch and label 100 million videos? It would be impossible.

So, the researchers built a brilliant two-stage "data engine" to create high-quality synthetic captions automatically.

Stage 1: The Rough Draft First, they took a bunch of unlabeled video clips. They ran them through several existing, decent-but-not-perfect AI models—some that caption video, some that caption audio. They fed all these imperfect captions into a Large Language Model (LLM). The LLM's job was to act as an editor, creating three clean captions for each clip: one for just the visuals, one for just the audio, and one for both combined. They used this data to train the first version of PE-AV.

Stage 2: The Polished Final Version Next, they took that first version of PE-AV and paired it with another powerful Meta model. This duo went back over the data and refined the captions, making them even more accurate and detailed. It’s like the AI taught itself to be a better data labeler.

This automated process allowed them to create a massive, high-quality dataset covering everything from speech and music to everyday sounds, which is crucial for building a truly general-purpose model.

Teaching the AI to Make Connections

So you have the model and the data. How do you actually teach it? They used a technique called "contrastive learning."

Think of it like a matching game for the AI. The model is shown a video clip, its corresponding audio, and its text caption. It's trained to pull these three things closer together in that "giant library" I mentioned earlier. At the same time, it’s shown a bunch of unrelated clips, sounds, and text, and it's trained to push those far apart.

By doing this millions of times across ten different types of pairs (audio-to-text, video-to-text, audio-to-video, etc.), the model builds an incredibly intuitive understanding of what sights, sounds, and words belong together.

Okay, But How Good Is It, Really?

This is where it gets impressive. PE-AV isn't just a cool research project; it's setting new records.

When compared to other leading models, PE-AV came out on top across a wide range of audio and video benchmarks. For example:

  • Finding Audio with Text: On a benchmark called AudioCaps, its ability to retrieve the correct audio clip from a text description jumped from 35.4% to an impressive 45.8% accuracy.
  • Understanding Video Content: On the popular ActivityNet benchmark for video retrieval, it also showed significant gains.
  • Zero-Shot Video Classification: On the tough Kinetics-400 video dataset, it outperformed models that were two to four times its size. That’s like a lightweight boxer outperforming a heavyweight—it’s just more efficient.

Across the board, whether it's understanding speech, music, or general sounds, PE-AV is pushing the state of the art forward.

One More Thing: Pinpointing Sounds in Time

Alongside the main PE-AV model, Meta also released a specialist called PEA-Frame. If PE-AV tells you a video contains a "guitar solo," PEA-Frame can tell you the guitar solo happens from the 1-minute, 5-second mark to the 1-minute, 20-second mark.

It does this by creating an embedding for every tiny slice of audio (about 40 milliseconds). This allows for incredibly precise sound event localization, which is a huge deal for applications like audio editing, security, and accessibility.

Why This All Matters for You and Me

This isn't just about researchers one-upping each other on benchmarks. This technology is the foundation for the next wave of AI tools.

In fact, PE-AV is the perception engine driving Meta’s SAM Audio model. SAM Audio can do magical things, like isolating a single instrument from a full band recording or separating one person's voice from a noisy crowd. To do that, it needs to first understand the prompt, whether it’s text ("isolate the drums") or visual (clicking on the drummer in a video). PE-AV is the brain that makes that initial understanding possible.

What we're seeing here is a fundamental step toward AI that perceives the world in a more holistic, human-like way. It’s an AI that doesn’t just see a silent movie but understands the full, rich, audiovisual story. And that opens up a whole new world of possibilities.

Tags

Machine Learning Deep Learning Meta AI Computer Vision Innovation Tech News Speech Technology AI Capabilities AI Research Open Source AI Multimodal AI Audio-Visual AI AI Breakthrough AI Audio AI Models AI Perception SAM Audio PE-AV Perception Encoder Audiovisual Large Scale Multimodal Retrieval

Stay Updated

Get the latest articles and insights delivered straight to your inbox.

We respect your privacy. Unsubscribe at any time.

Aicosoft

AI & Technology News, Insights & Innovation

AICOSOFT delivers cutting-edge AI news, technology breakthroughs, and innovation insights. Stay informed about artificial intelligence, machine learning, robotics, and the latest tech trends shaping tomorrow.

Connect With Us

© 2026 Aicosoft. All rights reserved.