Let's be honest, we all love Python. It's clean, it's readable, and the ecosystem is just incredible. But sometimes, when you're running heavy-duty simulations or scientific computations, you hit a wall. That pure Python speed just doesn't cut it.
You start thinking, "Do I need to rewrite this whole thing in C++ or CUDA?" It’s a classic dilemma: do you stick with the language you love for its simplicity, or do you jump to something more complex for the sake of performance?
What if I told you there's a way to get the best of both worlds?
That’s where NVIDIA Warp comes in, and trust me, it’s a bit of a game-changer. It’s a Python framework that lets you write code that can run on the GPU (or even multi-core CPUs) with incredible speed, but you get to write it all in a familiar Python-like syntax.
Today, I want to walk you through how this actually works. We're not just going to talk theory. We’re going to get our hands dirty and build a few things, from a basic parallel task to a full-blown differentiable physics simulation that can literally teach itself to solve a problem. Let's get started.
First Things First: Getting Your Environment Ready
Before we can build anything cool, we need to set up our workshop. The first step is making sure Warp and a few other helpful libraries like NumPy and Matplotlib are installed.
Think of this as grabbing your tools from the toolbox. We'll also initialize Warp and have it check if you've got a CUDA-enabled GPU. If you do, awesome! We'll use it. If not, no worries—Warp is smart enough to fall back to your CPU.
Here's the setup code. It’s a simple script that checks for the necessary packages and installs them if they're missing.
import sys
import subprocess
import pkgutil
def _install_if_missing(packages):
missing = [p for p in packages if pkgutil.find_loader(p["import_name"]) is None]
if missing:
subprocess.check_call([sys.executable, "-m", "pip", "install", "-q"] + [p["pip_name"] for p in missing])
_install_if_missing([
{"import_name": "warp", "pip_name": "warp-lang"},
{"import_name": "numpy", "pip_name": "numpy"},
{"import_name": "matplotlib", "pip_name": "matplotlib"},
])
import math
import time
import numpy as np
import matplotlib.pyplot as plt
import warp as wp
# Initialize Warp
wp.init()
# Check for a GPU and set our device
device = "cuda:0" if wp.is_cuda_available() else "cpu"
print(f"Using Warp device: {device}")
With that, we're ready to go. You can see we've printed out which device we're using. Now for the fun part.
Your First Warp Kernels: Parallelism in Action
The core concept in Warp is the "kernel." A kernel is just a function that you can tell your GPU to run thousands or even millions of times simultaneously. Each one of these runs is called a thread.
Imagine you need to paint a million pixels on a screen. The slow way is to paint them one by one. The parallel way is to hire a million tiny painters and have each one paint a single pixel, all at the exact same time. A Warp kernel is the instruction you give to all those tiny painters.
Let's write two simple kernels to see this in action.
The Classic SAXPY Operation
First up is a classic in parallel computing called SAXPY. It stands for "Single-Precision A*X Plus Y," which is a fancy way of saying we're going to compute a*x + y for huge arrays x and y. It's the "Hello, World!" of GPU programming.
@wp.kernel
def saxpy_kernel(a: wp.float32,
x: wp.array(dtype=wp.float32),
y: wp.array(dtype=wp.float32),
out: wp.array(dtype=wp.float32)):
# Get the unique ID for this thread
i = wp.tid()
# Each thread computes one element of the output array
out[i] = a * x[i] + y[i]
See that @wp.kernel decorator? That's the magic. It tells Warp, "Hey, take this Python function and transform it into super-fast code that can run on the GPU." Inside, wp.tid() gives us the unique ID of the thread, so each thread knows which element of the array it's responsible for.
Generating an Image with Math
Let's try something a bit more visual. We'll write a kernel that generates a procedural image based on a mathematical formula called a signed-distance field (SDF). Each thread will be responsible for calculating the color of a single pixel.
@wp.kernel
def image_sdf_kernel(width: int, height: int, pixels: wp.array(dtype=wp.float32)):
tid = wp.tid()
# Figure out the (x, y) coordinate for this pixel
x = tid % width
y = tid // width
# Normalize coordinates to a [-1, 1] range
fx = 2.0 * (wp.float32(x) / wp.float32(width - 1)) - 1.0
fy = 2.0 * (wp.float32(y) / wp.float32(height - 1)) - 1.0
# Some math to define shapes (circles and a wave)
r1 = wp.sqrt((fx + 0.35)**2 + fy * fy) - 0.28
r2 = wp.sqrt((fx - 0.25)**2 + (fy - 0.15)**2) - 0.18
wave = fy + 0.25 * wp.sin(8.0 * fx)
# Combine the shapes
d = wp.min(r1, r2)
d = wp.max(d, -wave)
# Calculate the final pixel value
value = wp.exp(-18.0 * wp.abs(d))
pixels[tid] = value
Don't worry too much about the math here. The key takeaway is that we're running this calculation for every single pixel at the same time.
Let's Run It!
Okay, we’ve defined our kernels. Now let's launch them.
First, we'll run the SAXPY kernel on a million elements and time it. We'll also check the result against what NumPy gives us to make sure our math is correct.
n = 1_000_000
a = np.float32(2.5)
# Create some data on the CPU with NumPy
x_np = np.linspace(0.0, 1.0, n, dtype=np.float32)
y_np = np.linspace(1.0, 2.0, n, dtype=np.float32)
# Copy the data to our target device (GPU or CPU) using Warp arrays
x_wp = wp.array(x_np, dtype=wp.float32, device=device)
y_wp = wp.array(y_np, dtype=wp.float32, device=device)
out_wp = wp.empty(n, dtype=wp.float32, device=device)
# Launch the kernel!
t0 = time.time()
wp.launch(kernel=saxpy_kernel,
dim=n,
inputs=[a, x_wp, y_wp],
outputs=[out_wp],
device=device)
wp.synchronize() # Wait for the GPU to finish
t1 = time.time()
# Check our work
out_np = out_wp.numpy()
expected = a * x_np + y_np
max_err = np.max(np.abs(out_np - expected))
print(f"SAXPY runtime: {t1 - t0:.4f}s, max error: {max_err:.6e}")
You should see an incredibly fast runtime. This is the power of running things in parallel.
Now, let's run our image kernel and see what we get.
width, height = 512, 512
pixels_wp = wp.empty(width * height, dtype=wp.float32, device=device)
# Launch the kernel across all pixels
wp.launch(kernel=image_sdf_kernel,
dim=width * height,
inputs=[width, height],
outputs=[pixels_wp],
device=device)
wp.synchronize()
# Reshape the 1D pixel array into a 2D image and show it
img = pixels_wp.numpy().reshape(height, width)
plt.figure(figsize=(6, 6))
plt.imshow(img, origin="lower")
plt.title(f"Warp procedural field on {device}")
plt.axis("off")
plt.show()
Boom! A complex image generated almost instantly. Pretty cool, right?
Building a Particle Simulation
Okay, simple vector math is one thing, but what about something more dynamic? Let's build a simple 2D particle simulation. We'll have a bunch of particles bouncing around in a box, affected by gravity.
We'll need two kernels for this:
- An
initkernel: This will set the starting positions and velocities of our particles. - A
simulatekernel: This will move the particles forward one step in time, applying physics like gravity and making them bounce off the walls.
# Kernel to initialize particle state
@wp.kernel
def init_particles_kernel(
n_particles: int,
px0: wp.array(dtype=wp.float32), py0: wp.array(dtype=wp.float32),
vx0: wp.array(dtype=wp.float32), vy0: wp.array(dtype=wp.float32),
px: wp.array(dtype=wp.float32), py: wp.array(dtype=wp.float32),
vx: wp.array(dtype=wp.float32), vy: wp.array(dtype=wp.float32)):
p = wp.tid()
px[p] = px0[p]
py[p] = py0[p]
vx[p] = vx0[p]
vy[p] = vy0[p]
# Kernel to run one step of the simulation
@wp.kernel
def simulate_particles_kernel(
n_particles: int, dt: wp.float32,
gravity: wp.float32, damping: wp.float32,
bounce: wp.float32, radius: wp.float32,
px: wp.array(dtype=wp.float32), py: wp.array(dtype=wp.float32),
vx: wp.array(dtype=wp.float32), vy: wp.array(dtype=wp.float32)):
tid = wp.tid()
s = tid // n_particles # which timestep
p = tid % n_particles # which particle
# Index for the current state and the next state
i0 = s * n_particles + p
i1 = (s + 1) * n_particles + p
# Load current state
x = px[i0]
y = py[i0]
u = vx[i0]
v = vy[i0]
# Apply gravity
v = v + gravity * dt
# Update position
x = x + u * dt
y = y + v * dt
# Boundary collisions (bouncing)
if y < radius:
y = radius
v = -bounce * v
u = damping * u
if x < -1.0 + radius:
x = -1.0 + radius
u = -bounce * u
if x > 1.0 - radius:
x = 1.0 - radius
u = -bounce * u
# Store the new state
px[i1] = x
py[i1] = y
vx[i1] = u
vy[i1] = v
The simulation kernel might look a little complex, but it's just basic physics. The important part is that we're going to run this for every particle and for every time step in parallel.
Let's set up the simulation parameters and run it.
n_particles = 256
steps = 300
dt = np.float32(0.01)
gravity = np.float32(-9.8)
damping = np.float32(0.985)
bounce = np.float32(0.82)
radius = np.float32(0.03)
# ... (code to set up initial positions/velocities px0_wp, py0_wp, etc.) ...
# This part is in the original code, setting up particles in a circle
# Create arrays to store the entire history of the simulation
state_size = (steps + 1) * n_particles
px_wp = wp.empty(state_size, dtype=wp.float32, device=device)
# ... (and so on for py_wp, vx_wp, vy_wp) ...
# Launch the init kernel
wp.launch(
kernel=init_particles_kernel,
dim=n_particles,
# ... inputs and outputs ...
)
# Launch the simulation kernel for all steps and all particles
wp.launch(
kernel=simulate_particles_kernel,
dim=steps * n_particles,
# ... inputs and outputs ...
)
wp.synchronize()
# ... (code to plot the trajectories) ...
When you run the full code and plot the results, you'll see the beautiful trajectories of all the particles bouncing around. And it calculates all 300 steps for all 256 particles incredibly fast.
The Grand Finale: Differentiable Physics
Now for the really mind-bending part. What if our simulation could not only show us what happens but also help us achieve a specific goal? This is called "differentiable physics."
The idea is simple: we define a "loss" function that measures how far we are from our goal. For example, "How far did our projectile land from the target?" Because Warp can automatically calculate gradients (i.e., how the output changes when you tweak the inputs), we can use optimization algorithms to automatically find the inputs that minimize our loss.
In other words, the simulation can teach itself how to hit the target!
We'll set up a simple projectile problem. Our goal is to find the initial launch velocity (vx, vy) that makes a projectile land at a specific target (x, y).
Here are the kernels we'll need:
# Kernel to set the initial state of the projectile
@wp.kernel
def init_projectile_kernel(
x_hist: wp.array(dtype=wp.float32), y_hist: wp.array(dtype=wp.float32),
vx_hist: wp.array(dtype=wp.float32), vy_hist: wp.array(dtype=wp.float32),
init_vx: wp.array(dtype=wp.float32), init_vy: wp.array(dtype=wp.float32)):
x_hist[0] = 0.0
y_hist[0] = 0.0
vx_hist[0] = init_vx[0]
vy_hist[0] = init_vy[0]
# Kernel to advance the projectile one step
@wp.kernel
def projectile_step_kernel(
dt: wp.float32, gravity: wp.float32,
x_hist: wp.array(dtype=wp.float32), y_hist: wp.array(dtype=wp.float32),
vx_hist: wp.array(dtype=wp.float32), vy_hist: wp.array(dtype=wp.float32)):
s = wp.tid()
# ... (basic projectile motion physics) ...
x_hist[s + 1] = x
y_hist[s + 1] = y
vx_hist[s + 1] = vx
vy_hist[s + 1] = vy
# Kernel to calculate how far we missed the target
@wp.kernel
def projectile_loss_kernel(
steps: int, target_x: wp.float32, target_y: wp.float32,
x_hist: wp.array(dtype=wp.float32), y_hist: wp.array(dtype=wp.float32),
loss: wp.array(dtype=wp.float32)):
# Calculate squared distance to target
dx = x_hist[steps] - target_x
dy = y_hist[steps] - target_y
loss[0] = dx * dx + dy * dy
Now, we'll set up an optimization loop. In each iteration, we'll:
- Run the simulation.
- Calculate the loss.
- Use Warp's
Tapeobject to automatically compute the gradients of the loss with respect to our initial velocities. - Update the velocities slightly in the direction that reduces the loss.
# ... (simulation parameters) ...
vx_value = np.float32(2.0) # Our initial guess
vy_value = np.float32(6.5)
lr = 0.08 # Learning rate
iters = 60
for it in range(iters):
# Tell Warp we need gradients for these inputs
init_vx_wp = wp.array(..., requires_grad=True)
init_vy_wp = wp.array(..., requires_grad=True)
# ... (and for the state arrays) ...
loss_wp = wp.zeros(1, ..., requires_grad=True)
tape = wp.Tape()
with tape:
# Launch all our kernels inside the tape
wp.launch(kernel=init_projectile_kernel, ...)
wp.launch(kernel=projectile_step_kernel, ...)
wp.launch(kernel=projectile_loss_kernel, ...)
# This is the magic! Automatically compute gradients.
tape.backward(loss=loss_wp)
# Get the gradients and update our velocities
grad_vx = float(init_vx_wp.grad.numpy()[0])
grad_vy = float(init_vy_wp.grad.numpy()[0])
vx_value -= lr * grad_vx
vy_value -= lr * grad_vy
# ... (print progress) ...
When you run this loop, you'll see the loss value dropping with each iteration as the initial velocity gets closer and closer to the perfect launch speed. After 60 iterations, it will have found an almost perfect solution to hit the target.
So, What Did We Just Do?
Let's take a step back. We went from a simple, one-line math operation to a dynamic particle simulation, and finally to a self-optimizing physics problem that learns the correct parameters to achieve a goal.
And the most amazing part? We did it all from a comfortable Python environment. We didn't have to write a single line of CUDA or C++. Warp handled the heavy lifting of compiling our Python-like code into high-performance kernels that fly on the GPU.
This is what makes tools like Warp so powerful. They bridge the gap between ease of use and raw performance, allowing us to build complex, high-speed simulations and even dip our toes into the world of differentiable physics without leaving our favorite language. Whether you're in scientific computing, robotics, or computer graphics, this is a seriously powerful tool to have in your back pocket.




