NVIDIA has announced that they have claimed the number one spot for all 7 tests in the MLPerf Training v5.1 benchmark dedicated to AI workloads.
Achieved through the GB300 NVL72 rack-scale system powered by Blackwell Ultra GPU, it has delivered 4x the pretraining speed on Llama 3.1 405B and nearly 5x the fine-tuning speed on Llama 2 70B, using the same number of GPUs compared to last-gen Hopper-based systems. Specifically, these gains come from the newer Tensor Cores offering 15 petaflops of NVFP4 AI compute, double the attention-layer performance, 279GB of HBM3e memory, and innovative training methods optimized for the GPU’s compute capabilities.
NVIDIA also showcased its Quantum-X800 InfiniBand platform, the first end-to-end 800 Gb/s networking solution, enabling unprecedented scale-out bandwidth when connecting multiple GB300 NVL72 systems.
A major breakthrough this round was the adoption of NVFP4 precision — a low-bit format that allows faster calculations while maintaining accuracy. Blackwell GPUs can perform FP4 calculations at double the rate of FP8, while Blackwell Ultra pushes this to three times, significantly boosting per-GPU AI compute performance. NVIDIA remains the only platform to submit MLPerf results using FP4 while meeting the benchmark’s strict accuracy standards.
Scaling up, NVIDIA trained Llama 3.1 405B in just 10 minutes using over 5,000 Blackwell GPUs, 2.7 times faster than its prior record. Even using 2,560 GPUs, training completed in 18.79 minutes, 45% faster than the last round, demonstrating efficient scaling and NVFP4’s impact on performance.
New benchmarks also saw NVIDIA break records. The Llama 3.1 8B model, replacing BERT-large, trained in 5.2 minutes using 512 Blackwell Ultra GPUs, while the FLUX.1 image generation model that replaces Stable Diffusion v2, trained in 12.5 minutes on 1,152 GPUs, with NVIDIA as the sole submitter. Existing benchmarks for graph neural networks, object detection, and recommender systems were also dominated by NVIDIA.












