CUDA-oxide: NVIDIA’s Rust-to-PTX Compiler [Hands-On]

CUDA-oxide is NVIDIA's experimental Rust-to-CUDA compiler. Here's what the v0.1.0 alpha actually does, how to run your first kernel, and where it breaks.

Drew Sullivan2026-05-139 min readBeginner

If you write GPU kernels and like Rust, you’ve probably hit the same wall everyone else has: the kernels themselves still live in C++. You write your host code in Rust, then drop into a .cu file, then bind back across FFI, then pray nothing leaks. CUDA-oxide is NVIDIA’s attempt to delete that whole step. It’s an experimental Rust-to-CUDA compiler that lets you write SIMT GPU kernels in safe(ish), idiomatic Rust and compiles them directly to PTX – no DSLs, no foreign language bindings.

The reason it landed on everyone’s radar in early May 2026: v0.1.0 dropped as the inaugural release, and Hacker News and Phoronix latched on within hours. This guide is for the person who saw the headline and wants to actually run a kernel today – not read another news rewrite. We’ll get a working install, write one kernel, and then talk about the sharp edges nobody’s covering yet.

What CUDA-oxide actually is (and isn’t)

At its core: a custom rustc codegen backend. Your kernel functions get routed through cuda-oxide’s pipeline instead of rustc’s standard LLVM backend – the #[kernel] proc macro marks which functions get that treatment. Everything else stays in normal Rust.

The middle of that pipeline is where things get interesting. Cuda-oxide uses Pliron – a Rust-native MLIR-like IR framework. Why does that choice matter? Because the entire compiler builds with cargo. The full lowering path is Rust MIR → dialect-mir → mem2reg → dialect-llvm → LLVM IR → PTX. No separate build system to fight.

What it isn’t: a way to magically port existing Rust code to a GPU. It’s the opposite. The stated design center (from the official ecosystem docs) is “bringing CUDA into Rust” – kernel authoring, device intrinsics, the SIMT execution model expressed natively in safe Rust. Closer in spirit to writing a __global__ function in C++ than to writing a generic Rust function that happens to run on a GPU. If that distinction sounds pedantic now, it’ll bite you the first time you try to call String::push in a kernel.

Getting it installed (the bits the README rushes past)

The official quick start is three commands. As of v0.1.0 (May 2026), requirements include Linux, CUDA Toolkit 12.x+, Rust nightly with rust-src and rustc-dev, LLVM 21+ with NVPTX support, and Clang/libclang headers for host CUDA bindings. macOS and Windows aren’t in the matrix.

cargo install --git https://github.com/NVlabs/cuda-oxide.git cargo-oxide
cargo oxide doctor
cargo oxide run vecadd

That second command is the one people skip. cargo oxide doctor validates your Rust toolchain, CUDA toolkit (including libNVVM / nvJitLink / libdevice for kernels that use math intrinsics), LLVM installation, and codegen backend in one shot. Run it. Skip it, and the next error you see will be an inscrutable linker complaint about stddef.h or a missing llc-21.

About that LLVM version – this is the first real trap. The compiler emits TMA / tcgen05 / WGMMA intrinsics that llc from LLVM 20 and earlier can’t handle. Simple kernels might still work with an older llc. But anything touching Hopper or Blackwell tensor cores needs LLVM 21+. Your vecadd will compile fine on LLVM 20. That success means nothing for real workloads.

Your first kernel, end to end

Here’s the canonical vecadd from the book. Every concept worth knowing shows up in twelve lines – but the code itself isn’t the interesting part. The interesting part is what DisjointSlice is doing:

use cuda_device::{cuda_module, kernel, thread, DisjointSlice};
use cuda_core::{CudaContext, DeviceBuffer, LaunchConfig};

#[cuda_module]
mod kernels {
 use super::*;

 #[kernel]
 fn vecadd(a: &[f32], b: &[f32], mut c: DisjointSlice<f32>) {
 let idx = thread::index_1d();
 let i = idx.get();
 if let Some(c_elem) = c.get_mut(idx) {
 *c_elem = a[i] + b[i];
 }
 }
}

fn main() {
 let ctx = CudaContext::new(0).unwrap();
 let stream = ctx.default_stream();
 let module = kernels::load(&ctx).unwrap();
 let a = DeviceBuffer::from_host(&stream, &[1.0f32; 1024]).unwrap();
 let b = DeviceBuffer::from_host(&stream, &[2.0f32; 1024]).unwrap();
 let mut c = DeviceBuffer::<f32>::zeroed(&stream, 1024).unwrap();
 module.vecadd(&stream, LaunchConfig::for_num_elems(1024), &a, &b, &mut c).unwrap();
}

No unsafe block, a thousand threads writing to c concurrently. DisjointSlice::get_mut() only accepts a ThreadIndex – a hardware-derived opaque type that guarantees unique writes per thread. The type system encodes “only the thread that owns this index can write here.” Data races on output buffers become structurally impossible at the API level. That’s the design, not a coincidence.

A successful build produces two artifacts side by side: a host binary and a .ptx file (e.g., target/debug/vecadd and target/debug/vecadd.ptx). The CUDA driver loads the PTX at runtime.

The pitfalls nobody’s writing about

The README will tell you it’s alpha. What it won’t spell out is which specific sharp edges to avoid first. Three are worth flagging now – two are documented in early community testing, one is buried in the architecture docs:

index_2d(stride) is unsound. Documented as such in the 0.x release. It compiles. It runs. It can produce wrong results. Use the 1D index API and compute your 2D coordinates manually until this is fixed.
bar.sync across branches will silently break. Manually duplicating bar.sync instructions across branches violates SIMT convergence and breaks hardware barrier semantics. The “safe(ish)” in the tagline is doing real work here – the compiler won’t catch this.
Lazy dependency compilation has a hidden failure mode. Device code from library dependencies is compiled lazily: the backend reads their Stable MIR from .rlib metadata on demand, only compiling functions a kernel actually calls. Meaning: a kernel that builds fine today can fail tomorrow when you start calling a different function from the same crate. The error won’t point at the dependency. Budget time for this.

Watch out: If your kernel suddenly stops compiling after a refactor that “only changed host code,” check whether you started calling a new function from a dependency. Lazy MIR compilation means the device-side error surfaces at the kernel, not at the import.

None of this is a dealbreaker. It’s an alpha. But you should know what you’re signing up for before you put cuda-oxide on the critical path of anything that matters.

Here’s the honest question this project raises: how much of the “safe” promise can a SIMT compiler actually deliver? Rust’s ownership model was designed for CPUs. The bar.sync gotcha above suggests there are GPU-specific safety properties that ownership alone can’t encode. That’s not a criticism of cuda-oxide specifically – it’s an open problem in the field. Worth watching how the project evolves its safety story over the next few releases.

How it stacks up against the alternatives

Project	Approach	When to pick it
cuda-oxide	rustc backend → PTX, NVIDIA only	You want safe Rust syntax but the full CUDA programming model on NVIDIA hardware
rust-cuda	rustc backend, Rust-first semantics	You want async/.await and parts of std running on-device
CubeCL	Embedded DSL with JIT runtime	One kernel needs to run on NVIDIA + AMD + WGPU
cudarc	Safe host-side bindings only	You’re launching PTX written elsewhere

The clearest way to understand the split – straight from the official ecosystem docs – is the direction of travel. Rust-cuda is about bringing Rust to NVIDIA GPUs: Rust ergonomics like async/.await, parts of the standard library running on-device, a Rust-first model that abstracts over CUDA concepts. Cuda-oxide is about bringing CUDA into Rust: device intrinsics, SIMT execution, the CUDA programming model expressed natively. Two different problems. Both valid.

So which should you pick? If you already know CUDA C++ and just want Rust ergonomics around it, cuda-oxide will feel familiar fast. If you’re a Rust developer who wants to stay in Rust idioms and never think about __global__ mentally, rust-cuda is closer to what you want. AMD or WebGPU in the mix? Neither of these – CubeCL is your answer.

What community reaction tells us

The first question on the Hacker News thread wasn’t about correctness or safety features. It was about build times – specifically how Rust CUDA crates that call into CMake or nvcc can be painfully slow to iterate on. That’s a tell. The people most excited about this release aren’t GPU researchers. They’re Rust developers who already burned hours on toolchain pain and want this one to be different.

Whether it is different over the long term is unanswered. We don’t have benchmarks yet, and the project is a couple of releases away from anything you’d ship. But the appetite is clearly there.

FAQ

Can I use cuda-oxide in production?

No. v0.1.0 is an early-stage alpha – expect bugs, incomplete features, and API breakage between releases.

Does it work on AMD GPUs or on Windows?

AMD: no – the compiler targets NVIDIA’s PTX directly. Architecturally out of scope. Windows: not listed in supported platforms as of v0.1.0. Linux with CUDA Toolkit 12.x+ is the target. WSL2 might work if your CUDA setup there is solid, but no confirmed walkthrough exists yet.

Does it support tensor cores and modern Hopper/Blackwell features?

Yes, but with a caveat worth spelling out carefully. The compiler emits TMA, tcgen05, and WGMMA intrinsics – that’s exactly why LLVM 21 is mandatory. If your distro defaulted to LLVM 20, simple kernels like vecadd will run fine. Tensor core kernels will fail at the llc stage. The common misconception: “cargo oxide doctor passed, so I’m ready.” Doctor passing means you can build the example. It does not mean your LLVM is correct for real workloads. Test a kernel that actually uses the features you need before assuming the install is complete.

Try this next

Clone the repo, run cargo oxide doctor, then go straight to crates/rustc-codegen-cuda/examples/. The v0.1.0 release ships 46 example kernels covering vector add, generics, closures, atomics, clusters, async execution, MathDx interop, and GEMM. Pick the one closest to what you actually want to build and read its source – it’s the fastest way to learn what subset of Rust actually compiles inside a #[kernel].