OpenAI's Triton compiler and Google's JAX framework represent the two most significant structural threats to CUDA lock-in. Triton enables writing GPU kernels once that compile to NVIDIA, AMD, and Intel hardware with near-parity performance -- the vLLM inference engine now uses Triton as its cross-platform attention backend, achieving 100.7% of FlashAttention 3 performance on H100 and 5.8x speedup on AMD MI300 with the same 800-line codebase (vs 70,000 lines for FlashAttention3 in CUDA). Triton has 18.8k GitHub stars, its 3rd Developer Conference was hosted by Microsoft (Oct 2025), and NVIDIA itself responded by building a CUDA Tile IR backend FOR Triton -- effectively validating Triton as the emerging standard.
JAX adoption is growing in research and TPU-centric workflows, though PyTorch remains dominant for industry. The PyTorch Foundation's Accelerator Integration Working Group is making PyTorch itself hardware-agnostic, with first-class ROCm support (PyTorch 2.9), XLA/TPU backend, and Google's TorchTPU initiative backed by Meta. The key dynamic: CUDA's moat is shifting from 'you must write CUDA' to 'CUDA compiles fastest' -- a narrower advantage that depends on sustained performance leadership rather than ecosystem lock-in..
Platform moat narrows at edges but holds at core
CUDA remains the dominant AI development framework with millions of developers. Alternative frameworks like JAX and Triton are growing but haven't yet achieved production parity for most enterprise workloads.
Will Triton's CUDA Tile IR backend achieve performance parity with hand-tuned CUDA kernels for training workloads, or will the 'incubator' status persist?