Have you ever seen those behind-the-scenes videos from movies or video games? The ones with actors in funny-looking suits covered in little white balls? That’s traditional motion capture, or "mo-cap." It’s incredibly powerful, but let’s be honest, it’s also incredibly expensive and complicated. You need special cameras, dedicated lab space, and a whole lot of patience.
But what if you could do something similar without all the fuss? What if you could take videos from a few regular cameras—even GoPros or smartphones—and use AI to create a detailed 3D skeleton that moves just like the person in the video?
That’s the promise of markerless motion capture, and it’s no longer a far-off dream. Tools are emerging that put this power into the hands of researchers, developers, and creators everywhere. Today, we're going to walk through one of my favorites: a powerful open-source pipeline called Pose2Sim. We’ll go from raw video files to a full-blown biomechanical analysis, and we’ll do it all together in a free Google Colab notebook.
Think of this as a hands-on workshop. I’ll be your guide, and by the end, you’ll not only have run a full 3D motion pipeline but you’ll actually understand what’s happening at every step of the way. Let’s get started.
Setting Up Our Digital Workbench
Before we can start building our virtual skeleton, we need to get our tools in order. We're going to be working in Google Colab, which is fantastic because it gives us free access to a GPU—something that will seriously speed up the AI part of our work.
First, we need to install Pose2Sim. It’s a Python package that cleverly bundles everything we need: the AI for detecting human poses (RTMPose), tools for 3D reconstruction, and even connections to a biomechanics engine called OpenSim.
Once that’s installed, we’ll do a quick check to see if a GPU is available. If you’re in Colab, you can usually select one from the "Runtime" menu. If you don't have one, this will still work, but the pose detection step will feel like it’s moving through molasses. With a GPU, it’ll fly.
Next, we’ll copy over a demo project. This gives us a clean, ready-to-go dataset with videos and camera calibration files, so we don't have to worry about shooting our own footage just yet.
The Project's Control Panel: Config.toml
Every Pose2Sim project has a heart, and it's a little file called Config.toml. Think of this file as the central control panel or the recipe for our entire process. It has settings for every single step, from what AI model to use to how to filter the final data.
For our Colab environment, we need to make a few small tweaks. Since Colab runs "headless" (meaning there’s no screen attached), we have to tell Pose2Sim not to try and pop up any display windows. We’re essentially putting it in a quiet, non-visual mode so it can run smoothly in the background. We’ll also adjust a few settings to balance speed and accuracy for this tutorial.
With our environment set up and our config file tuned, we’re ready for the fun part.
The Assembly Line: From 2D Pixels to 3D Motion
The journey from a video file to a set of joint angles is like an assembly line. Each step takes the output from the previous one, refines it, and passes it along. Let’s walk through it station by station.
Step 1: Teaching the Cameras Their Place (Calibration)
Before we can combine views from multiple cameras, we have to teach them about their relationship to each other and to the world. This is calibration. It figures out two things for each camera:
- Intrinsics: The camera's internal properties, like its lens distortion. Think of this as giving the camera a proper pair of glasses.
- Extrinsics: The camera's exact position and orientation in 3D space. This is like giving each camera a precise GPS coordinate and a compass direction.
Our demo project already has this information, so we’re just converting it into the format Pose2Sim understands. If you were starting from scratch, you’d typically film a checkerboard pattern from all your cameras to calculate this automatically.
Step 2: Playing Connect-the-Dots with AI (2D Pose Estimation)
This is where the AI magic really begins. We feed our videos to a model called RTMPose. Its job is to go through every single frame, from every camera, and identify the location of key body joints—like the shoulders, elbows, knees, and ankles.
It’s basically playing a super-fast, super-accurate game of connect-the-dots on the human body. The output is a series of 2D coordinates (x, y pixel locations) for each joint, for each frame, for each video. This is the raw material for our 3D reconstruction.
Step 3: Getting Everyone on the Same Page (Synchronization & Association)
So now we have folders full of 2D body poses, one for each camera. But there’s a problem: the videos probably didn't start at the exact same millisecond. We need to sync them up. Pose2Sim does this cleverly by looking at the motion patterns (like the vertical speed of a person's ankle) and finding the time offset that makes them align perfectly. It’s the digital equivalent of a film editor's clapboard.
Once they’re synced, we need to make sure that the person detected in camera A is matched with the same person in camera B. For our single-person demo, this is easy. In a multi-person video, this step uses geometry to figure out who is who across all the different views.
Step 4: The Leap into the Third Dimension (Triangulation)
This is the moment we’ve been waiting for. We take the synchronized 2D keypoints from all our cameras and use them to calculate where each joint is in 3D space.
It’s a bit like how our own brains perceive depth. By having two different views (our eyes), we can judge distance. Triangulation does the same thing with math, using the 2D points from at least two cameras to pinpoint a single 3D coordinate (X, Y, Z) for every joint. The output is a .trc file, which is a standard format for 3D marker data that tools like OpenSim can read.
Step 5: Smoothing Out the Jitters (Filtering)
The raw 3D data from triangulation can be a little noisy. Tiny errors in the 2D detection can lead to small "jitters" in the 3D motion. The filtering step is all about smoothing this out.
It applies a filter (a Butterworth filter is the default) to the 3D trajectories, much like a graphic designer might use a smoothing tool on a shaky line. This results in a much cleaner, more natural-looking motion that’s better for analysis.
Step 6: The Final Frontier (OpenSim Kinematics)
We have clean, 3D coordinates for a bunch of keypoints. That’s amazing, but it’s still just a collection of moving dots. The final step is to turn this into real biomechanical data.
This is where OpenSim, a powerful biomechanics simulator, comes in. This step does two crucial things:
- Scaling: It takes a generic, one-size-fits-all human skeleton model and scales it to match the dimensions of the person in your video.
- Inverse Kinematics (IK): This is the mind-bending part. The IK solver figures out the set of joint angles (like knee flexion, hip rotation, etc.) that would make the scaled model’s markers best match the 3D keypoints we just calculated.
The result? A file containing the angles of each joint in the body, measured in degrees, for every frame of the video. This is the gold-standard data used in sports science, rehabilitation, and animation.
Did It Work? Let's See the Results!
Running a pipeline is one thing; understanding the output is another. A bunch of data files isn't very intuitive. So, let's visualize what we've just created.
We can plot the 3D keypoints from a single frame to see our virtual skeleton hanging in space. We can also plot the trajectory of specific points—like a hand or a foot—over time to see the path they traveled. This is a great way to do a quick "sanity check" to see if the motion looks plausible.
If you were able to run the final OpenSim step, you can even plot the joint angles. You can see a graph of how the knee angle changes as the person walks, or how the shoulder rotates. This is where you can start to ask really interesting questions about the quality of the movement itself.
Becoming the Pilot: Advanced Tweaks and Tips
Once you've run the pipeline once, you can start experimenting. Remember that Config.toml file? You can change anything in there.
- Want higher accuracy? You can switch the RTMPose model from
balancedtoperformancemode. It'll be slower, but the 2D detections might be better. - Tracking multiple people? Just flip a switch in the config, and the whole pipeline will adapt.
- Not happy with the smoothing? You can change the filter type or adjust its parameters to be more or less aggressive.
This programmatic control is what makes Pose2Sim so powerful for research. You can easily run experiments to see how different settings affect the final outcome. And when you're ready to run your own project, you can simply call Pose2Sim.runAll() to execute the entire assembly line with a single command.
Where to Go From Here
In a remarkably short amount of time, we've gone from a few simple video files to a sophisticated biomechanical analysis. We installed a complete motion capture pipeline, calibrated our virtual cameras, used AI to detect a person's pose, reconstructed it in 3D, and calculated their joint angles.
The barrier to entry for human motion analysis has never been lower. You don’t need a million-dollar lab anymore; you just need a few cameras and a bit of curiosity.
So now it's your turn. Try this with your own videos. Film yourself doing a squat, a golf swing, or just walking across a room. Experiment with the settings, see what works, and explore the incredible world of movement that is now, quite literally, at your fingertips. What will you build?




