ROCm is the software gatekeeper to AMD's GPU ambitions. ROCm 7.0 delivered a 3.5x inference performance improvement over v6, and MI355X hardware benchmarks show competitive or better performance than NVIDIA B200 on specific workloads. But the CUDA gap remains real for training: 10-30% behind on compute-intensive workloads, requiring more manual optimization. The critical question is whether hardware can permanently compensate for software gaps.
Open-source compilers are gradually eroding CUDA's lock-in. Triton (OpenAI) enables hardware-agnostic development, PyTorch 2.0 torch.compile reduces CUDA-specific needs, and JAX supports AMD GPUs natively. However, NVIDIA is responding — CUDA Tile IR open-sourcing incorporates MLIR/LLVM, potentially making it harder for AMD to differentiate. The ecosystem battle is far from won.
What % of AMD GPU workloads use ROCm natively vs Triton/JAX abstraction?