
A3 Ultra and A4 VMs MLPerf 5.0 Inference results
For MLPerf™ Inference v5.0, we submitted 15 results, including our first submission with A3 Ultra (NVIDIA H200) and A4 (NVIDIA HGX B200) VMs. The A3 Ultra VM is powered by eight NVIDIA H200 Tensor Core GPUs and offers 3.2 Tbps of GPU-to-GPU non-blocking network bandwidth and twice the high bandwidth memory (HBM) compared to A3 Mega with NVIDIA H100 GPUs. Google Cloud’s A3 Ultra demonstrated highly competitive performance, achieving results comparable to NVIDIA’s peak GPU submissions across LLMs, MoE, image, and recommendation models.
Google Cloud was the only cloud provider to submit results on NVIDIA HGX B200 GPUs, demonstrating excellent performance of A4 VM for serving LLMs including Llama 3.1 405B (a new benchmark introduced in MLPerf 5.0). A3 Ultra and A4 VMs both deliver powerful inference performance, a testament to our deep partnership with NVIDIA to provide infrastructure for the most demanding AI workloads.
Customers like JetBrains are using Google Cloud GPU instances to accelerate their inference workloads:
“We’ve been using A3 Mega VMs with NVIDIA H100 Tensor Core GPUs on Google Cloud to run LLM inference across multiple regions. Now, we’re excited to start using A4 VMs powered by NVIDIA HGX B200 GPUs, which we expect will further reduce latency and enhance the responsiveness of AI in JetBrains IDEs.” – Vladislav Tankov, Director of AI, JetBrains
AI Hypercomputer is powering the age of AI inference
Google’s innovations in AI inference, including hardware advancements in Google Cloud TPUs and NVIDIA GPUs, plus software innovations such as JetStream, MaxText, and MaxDiffusion, are enabling AI breakthroughs with integrated software frameworks and hardware accelerators. Learn more about using AI Hypercomputer for inference. Then, check out these JetStream and MaxDiffusion recipes to get started today.
Source Credit: https://cloud.google.com/blog/products/compute/ai-hypercomputer-inference-updates-for-google-cloud-tpu-and-gpu/