NVIDIA's New `cuda-oxide`: Finally, You Can Write CUDA Kernels in Pure Rust

Akram Chauhan
Akram Chauhan
7 min read105 views
NVIDIA's New `cuda-oxide`: Finally, You Can Write CUDA Kernels in Pure Rust

Alright, let's talk. If you've ever tried to write code that runs directly on an NVIDIA GPU, you know the deal. For years, it's mostly been a C++ world, using the CUDA toolkit. It’s powerful, no doubt, but it can also feel a bit… dated. And let's be honest, it lacks the safety and modern ergonomics we've come to love from languages like Rust.

People have tried to bridge this gap. We've seen Python wrappers like Triton and various Rust projects trying to make it work, but there's always been a catch. Usually, you're dealing with complex bindings, domain-specific languages (DSLs), or jumping through hoops to connect your Rust code to some C++ under the hood.

Well, a team of researchers at NVIDIA AI just dropped something that might change the game entirely: a new experimental compiler called cuda-oxide.

And the headline is a big one: you can write CUDA GPU kernels in standard Rust code. No C++, no FFI bindings, no weird DSLs. Just Rust, compiled directly to PTX—the assembly-like language that NVIDIA GPUs actually understand.

This is a pretty big deal, so let's unpack what it is, how it works, and how you can actually try it out yourself.

So, What’s the Big Idea Behind cuda-oxide?

Think of it this way: cuda-oxide isn't trying to hide CUDA from you. It's doing the opposite. The project's philosophy is about "bringing CUDA into Rust."

This means it embraces the core concepts of CUDA—the SIMT execution model (Single Instruction, Multiple Threads), the way you think about thread blocks and grids, and the raw power of device intrinsics—but lets you express all of it using safe, native Rust.

It feels less like a generic Rust function that just happens to run on a GPU, and more like you're writing a traditional __global__ CUDA C++ kernel, but with all the benefits of Rust's type system and safety guarantees.

This makes it different from its closest neighbor, the fantastic rust-cuda project. rust-cuda is more focused on "bringing Rust to NVIDIA GPUs," aiming to get things like async/.await and parts of the standard library working on the device. The two projects are actually complementary, and the teams are even coordinating. It's not a competition; it's about giving developers more options.

Let's Look Under the Hood: The Compilation Pipeline

So how does this magic actually happen? cuda-oxide is a custom backend for the Rust compiler, rustc. When you compile your code, it hijacks the final code generation step and runs it through a special pipeline just for your GPU code.

It looks something like this: Rust Source → Stable MIR → Pliron IR → LLVM IR → PTX

Let's break that down, because the details are pretty cool.

  • Stable MIR: The Rust compiler's internal representation (MIR) can change with every nightly update, which would be a nightmare for a tool like this. cuda-oxide cleverly uses rustc_public (also known as Stable MIR), which is an official, versioned API for the compiler's internals. This means the compiler can read the MIR without breaking every time you update your toolchain. Smart.

  • Pliron: This is maybe the most interesting part. Instead of using the big, complex C++-based MLIR framework, the team built Pliron—a similar "IR framework" written entirely in Rust. What does this mean for you? The entire compiler builds with cargo. No wrestling with CMake, no C++ toolchain headaches. It’s a huge quality-of-life improvement.

  • LLVM & PTX: After Pliron does its thing, it spits out standard LLVM IR in a text file (.ll). From there, it hands off to the one external part of the chain: the standard llc tool from LLVM, which compiles that IR into the final PTX assembly file for the GPU.

The best part is that your host code (the normal CPU stuff) and your device code (the GPU kernel) live in the same .rs file. The build tool, cargo oxide, is smart enough to find functions marked with a #[kernel] attribute, send them down the GPU pipeline, and compile everything else normally. The result is a standard host binary and a .ptx file sitting right next to it, ready to be loaded at runtime.

Getting Your Hands Dirty: A Quick-Start Guide

Okay, theory is great, but let's get to the fun part. The project is still experimental and has some specific requirements, but getting started is surprisingly straightforward.

Step 1: Check Your Setup

This is a Linux-only affair for now (tested on Ubuntu 24.04). Before you start, make sure you have:

  • Linux (Ubuntu 24.04 is the reference)
  • Rust nightly (a specific version, but the repo handles this for you)
  • CUDA Toolkit 12.x or newer
  • LLVM 21+ (this is important, especially for new GPUs)
  • Clang 21

A quick heads-up on LLVM 21: You really do need version 21 or later if you want to target Hopper or Blackwell GPUs and use their advanced features. Don't skimp on this.

Step 2: Install the Tools

First, clone the repo. The project uses a rust-toolchain.toml file, so the correct nightly version will be installed automatically for you when you run a command inside the directory.

git clone https://github.com/NVlabs/cuda-oxide.git
cd cuda-oxide

Next, you’ll need to install the cargo-oxide subcommand, which is the main driver for building, running, and debugging.

# Install the cargo-oxide helper tool
cargo install \
  --git https://github.com/NVlabs/cuda-oxide.git \
  cargo-oxide

Once that's done, run the built-in health check. This little command is a lifesaver and will tell you if anything in your environment is misconfigured.

cargo oxide doctor

It checks your Rust version, CUDA toolkit, LLVM installation (and critically, whether it has the NVPTX backend), and Clang. If you see any red, fix it before you go any further.

Common Pitfall: A frequent error is a missing stddef.h file. This usually means you installed a minimal libclang runtime package but not the full clang-21 or libclang-common-21-dev package, which includes the necessary header files. The doctor command should catch this for you.

Step 3: Run Your First Kernel!

The project comes with a classic vecadd example (adding two vectors together). Running it is as simple as:

cargo oxide run vecadd

If all goes well, you’ll see a success message. That's it! You just compiled and ran a pure Rust kernel on your NVIDIA GPU.

You can even see the entire compilation journey, from Rust MIR all the way down to the final PTX, with this command:

cargo oxide pipeline vecadd

Writing a Kernel: Safety First

So, what does a cuda-oxide kernel actually look like? Here’s a simple example that scales the elements of an array.

use cuda_device::{kernel, thread, DisjointSlice};

#[kernel]
pub fn scale(input: &[f32], factor: f32, mut out: DisjointSlice<f32>) {
    let idx = thread::index_1d();

    if let Some(elem) = out.get_mut(idx) {
        *elem = input[idx.get()] * factor;
    }
}

This looks a lot like normal Rust, with a few key GPU-specific things.

The #[kernel] attribute marks this as GPU code. But the really clever part is DisjointSlice<T> and thread::index_1d().

In GPU programming, one of the biggest dangers is a data race, where multiple threads try to write to the same memory location at the same time. cuda-oxide offers a beautiful solution for this, which they call "Tier 1 Safety."

The thread::index_1d() function returns a unique hardware ID for each thread. This isn't just a number; it's a special ThreadIndex type. The DisjointSlice::get_mut() method will only accept a ThreadIndex as its argument.

Because each thread is guaranteed by the hardware to have a unique index, it's impossible for two threads to get a mutable reference to the same element. Just like that, a whole class of nasty GPU bugs is eliminated by the type system. No unsafe required. It's the Rust philosophy, applied perfectly to the world of parallel computing.

What's Next? From Simple Kernels to Blazing Speed

Once you've got the basics down, there's a whole world of performance to unlock. You can write generic kernels, pass closures from the host, and use a full suite of GPU intrinsics for warp operations, shared memory, and atomics.

The project even includes some seriously impressive examples:

  • An async_mlp example showing how to do asynchronous GPU execution.
  • A gemm_sol example that achieves a staggering 868 TFLOPS on a Blackwell B200 GPU—that's 58% of the theoretical speed-of-light performance you'd get from the highly optimized cuBLAS library.
  • Examples targeting the latest Blackwell tensor cores, showing this is a forward-looking project.

A Small Word of Caution

The project is still version 0.1.0, so it's experimental. The team is transparent about this. For example, they've documented that one of the 2D indexing functions, index_2d(stride), is currently unsound under specific conditions. It's a reminder that this is cutting-edge stuff, but the honesty is refreshing.

For anyone who loves Rust and has wanted a more direct, safer, and more ergonomic way to program NVIDIA GPUs, cuda-oxide feels like a massive step in the right direction. It combines the low-level control of CUDA with the high-level safety of Rust, and that’s a very exciting combination indeed.

Tags

Deep Learning Tech Breakthrough] AI Hardware Developer Tools Software Development NVIDIA AI High-Performance Computing cuda-oxide Rust CUDA Rust GPU Programming CUDA Compiler PTX Compiler GPU Kernels Rust for AI NVIDIA GPU Programming Languages

Stay Updated

Get the latest articles and insights delivered straight to your inbox.

We respect your privacy. Unsubscribe at any time.

Aicosoft

AI & Technology News, Insights & Innovation

AICOSOFT delivers cutting-edge AI news, technology breakthroughs, and innovation insights. Stay informed about artificial intelligence, machine learning, robotics, and the latest tech trends shaping tomorrow.

Connect With Us

© 2026 Aicosoft. All rights reserved.