Writing fast GPU code is one of the most grueling specializations in machine learning engineering. Researchers from RightNow AI want to automate it entirely.
The RightNow AI research team has released AutoKernel, an open-source framework that applies an autonomous LLM agent loop to GPU kernel optimization for arbitrary PyTorch models. The approach is straightforward: give it any model before you go to bed, and wake up to faster Triton kernels — no GPU expertise required.
https://arxiv.org/pdf/2603.21331
Why GPU Kernels Are So Hard to Optimize
A GPU kernel is a function that runs in parallel across thousands of GPU cores. When you run a transformer model like LLaMA or GPT-2, the bulk of compute time is spent inside kernels for operations like matrix multiplication (matmul), softmax, layer normalization, and attention. These kernels live in libraries like cuBLAS and cuDNN, or get generated automatically by PyTorch’s compilation pipeline.
The problem is that squeezing maximum performance out of these kernels requires reasoning simultaneously about arithmetic intensity, memory coalescing, register pressure, tile sizes, warp-level synchronization, and tensor core instruction selection — a combination of skills that takes years to develop. A single high-performance matmul kernel may involve 200+ lines of CUDA or Triton code with dozens of interdependent parameters. This expertise is scarce, and the manual tuning process scales poorly as model architectures evolve.
The benchmark suite KernelBench, which evaluates frontier LLMs on 250 GPU kernel problems, found that even the best models matched PyTorch baseline performance in fewer than 20% of cases using one-shot generation. AutoKernel was built directly in response to that gap.
The Loop: Edit, Benchmark, Keep or Revert
AutoKernel’s core insight is that an expert kernel engineer’s workflow is itself a simple loop: write a candidate, benchmark it, keep improvements, discard regressions, repeat. The framework mechanizes this loop. An LLM agent modifies a single file — kernel.py — a fixed benchmark harness verifies correctness and measures throughput, and the result determines whether the change persists. Crucially, every experiment maps to a git commit. Kept experiments advance the branch; reverted experiments are erased cleanly with git reset. The entire history is browsable with standard git tools, and experiment results are logged to a plain tab-separated results.tsv file — dependency-free, human-readable, and trivially parseable by the agent.
Each iteration takes approximately 90 seconds — 30 seconds for correctness checking, 30 seconds for performance benchmarking via Triton’s do_bench, and 30 seconds for agent reasoning and code modification. At roughly 40 experiments per hour, an overnight 10-hour run yields 300 to 400 experiments across multiple kernels.
This design draws directly from Andrej Karpathy’s autoresearch project, which demonstrated that an AI agent running a keep/revert loop on LLM training code could discover 20 optimizations across 700 experiments in two days on a single GPU. AutoKernel transplants this loop to kernel code, with a different search space and a correctness-gated benchmark as the evaluation function instead of validation loss.
The agent reads a 909-line instruction document called program.md, which encodes expert knowledge into a six-tier optimization playbook. The tiers progress from block size tuning (sweeping tile dimensions through powers of 2, adjusting num_warps and num_stages) through memory access patterns (coalesced loads, software prefetching, L2 swizzling), compute optimizations (TF32 accumulation, epilogue fusion), advanced techniques (split-K, persistent kernels, Triton autotune, warp specialization), architecture-specific strategies (TMA on Hopper, cp.async on Ampere, adjusted sizes for L4/RTX), and finally kernel-specific algorithms like online softmax for attention and Welford’s algorithm for normalization. The instruction document is intentionally comprehensive so the agent can run 10+ hours without getting stuck.
https://arxiv.org/pdf/2603.21331
Profiling First, Optimizing Where It Matters
Unlike prior work that treats kernel problems in isolation, AutoKernel starts from a complete PyTorch model. It uses torch.profiler with shape recording to capture per-kernel GPU time, then ranks optimization targets using Amdahl’s law — the mathematical principle that the overall speedup you can achieve is bounded by how much of the total runtime that component represents. A 1.5× speedup on a kernel consuming 60% of total runtime yields a 1.25× end-to-end gain. The same speedup on a kernel consuming 5% of runtime yields only 1.03×.
The profiler detects GPU hardware from a database of known specifications covering both NVIDIA (H100, A100, L40S, L4, A10, RTX 4090/4080/3090/3080) and AMD (MI300X, MI325X, MI350X, MI355X) accelerators. For unknown GPUs, it estimates peak FP16 throughput from SM count, clock rate, and compute capability — making the system usable across a wider range of hardware than just the latest NVIDIA offerings.
The orchestrator (orchestrate.py) transitions from one kernel to the next when any of four conditions are met: five consecutive reverts, 90% of GPU peak utilization reached, a two-hour elapsed time budget, or a 2× speedup already achieved on that kernel. This prevents the agent from spending excessive time on kernels with diminishing returns while higher-impact targets wait.
Five-Stage Correctness Harness
Performance without correctness is useless, and AutoKernel is particularly thorough on this front. Every candidate kernel passes through five validation stages before any speedup is recorded. Stage 1 runs a smoke test on a small input to catch compilation errors and shape mismatches in under a second. Stage 2 sweeps across 8 to 10 input configurations and three data types — FP16, BF16, and FP32 — to catch size-dependent bugs like boundary handling and tile remainder logic. Stage 3 tests numerical stability under adversarial inputs: for softmax, rows of large identical values; for matmul, extreme dynamic range; for normalization, near-zero variance. Stage 4 verifies determinism by running the same input three times and requiring bitwise identical outputs, which catches race conditions in parallel reductions and non-deterministic atomics. Stage 5 tests non-power-of-two dimensions like 1023, 4097, and 1537 to expose masking bugs and tile remainder errors.
Tolerances are dtype-specific: FP16 uses atol = 10⁻², BF16 uses 2 × 10⁻², and FP32 uses 10⁻⁴. In the paper’s full evaluation across 34 configurations on an NVIDIA H100, all 34 passed correctness with zero failures across eager, compiled, and custom kernel outputs.
Dual Backend: Triton and CUDA C++
AutoKernel supports both Triton and CUDA C++ backends within the same framework. Triton is a Python-like domain-specific language that compiles JIT in 1 to 5 seconds, making it ideal for rapid iteration — the agent can modify block sizes, warp counts, pipeline stages, accumulator precision, and loop structure. Triton routinely reaches 80 to 95% of cuBLAS throughput for matmul. CUDA C++ is included for cases requiring direct access to warp-level primitives, WMMA tensor core instructions (using 16×16×16 fragments), vectorized loads via float4 and half2, bank-conflict-free shared memory layouts, and double buffering. Both backends expose the same kernel_fn() interface, so the benchmark infrastructure runs identically regardless of backend.
The system covers nine kernel types spanning the dominant operations in modern transformer architectures: matmul, flash_attention, fused_mlp, softmax, layernorm, rmsnorm, cross_entropy, rotary_embedding, and reduce. Each has a PyTorch reference implementation in reference.py serving as the correctness oracle, and the benchmark computes throughput in TFLOPS or GB/s alongside roofline utilization against detected GPU peak.
Benchmark Results on H100
Measured on an NVIDIA H100 80GB HBM3 GPU (132 SMs, compute capability 9.0, CUDA 12.8) against PyTorch eager and torch.compile with max-autotune, the results for memory-bound kernels are significant. RMSNorm achieves 5.29× over eager and 2.83× over torch.compile at the largest tested size, reaching 2,788 GB/s — 83% of H100’s 3,352 GB/s peak bandwidth. Softmax reaches 2,800 GB/s with a 2.82× speedup over eager and 3.44× over torch.compile. Cross-entropy achieves 2.21× over eager and 2.94× over torch.compile, reaching 2,070 GB/s. The gains on these kernels come from fusing multi-operation ATen decompositions into single-pass Triton kernels that minimize HBM (High Bandwidth Memory) traffic.
AutoKernel outperforms torch.compile on 12 of the 16 representative configurations benchmarked in the paper, despite torch.compile with max-autotune running its own Triton autotuning. TorchInductor’s generic fusion and autotuning does not always find the specialized tiling and reduction strategies that kernel-specific implementations exploit.
Matmul is notably harder — PyTorch’s cuBLAS backend is extensively tuned per GPU architecture. The Triton starter reaches 278 TFLOPS, well below cuBLAS. However, at the 2048³ size, AutoKernel beats torch.compile by 1.55×, demonstrating that TorchInductor’s matmul autotuning is not always optimal either. Closing the cuBLAS gap remains the primary target for continued agent iteration.
In community deployment, an AutoKernel-optimized kernel took first place on the vectorsum_v2 B200 leaderboard with a latency of 44.086µs, outperforming the second-place entry at 44.249µs and third place at 46.553µs. A community user also reported that a single AutoKernel prompt — requiring approximately three minutes of agent interaction — produced a Triton FP4 matrix multiplication kernel that outperforms CUTLASS by 1.63× to 2.15× across multiple shapes on H100. CUTLASS represents hand-optimized C++ template code specifically designed for NVIDIA tensor cores, making this result particularly notable.
Key Takeaways
- AutoKernel turns weeks of expert GPU tuning into an overnight autonomous process. By mechanizing the write-benchmark-keep/revert loop that expert kernel engineers already follow, the system runs 300 to 400 experiments per overnight session on a single GPU without any human intervention.
- Correctness is non-negotiable before any speedup is recorded. Every candidate kernel must pass a five-stage harness covering smoke tests, shape sweeps across 10+ configurations, numerical stability under adversarial inputs, determinism verification, and non-power-of-two edge cases — eliminating the risk of the agent “optimizing” its way to incorrect outputs.
- Memory-bound kernels see the biggest gains over both PyTorch eager and torch.compile. On an NVIDIA H100, AutoKernel’s Triton kernels achieve 5.29× over eager on RMSNorm, 2.82× on softmax, and 2.21× on cross-entropy — with the gains coming from fusing multi-operation ATen decompositions into single-pass kernels that minimize HBM traffic.
- Amdahl’s law drives where the agent spends its time. Rather than optimizing kernels in isolation, AutoKernel profiles the entire PyTorch model and allocates effort proportionally to each kernel’s share of total GPU runtime — ensuring that improvements compound at the model level, not just the kernel level.
Check out the Paper and Repo. Also, feel free to follow us on Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

