
Boost GPU training efficiency with proven recipes
Beyond setup, maximizing useful compute time (“ML Goodput”) during training is crucial, especially at scale. Wasted cycles due to job failures can significantly inflate costs and delay results. To help, we provide techniques and ready-to-use recipes to tackle these challenges.
Techniques like asynchronous and multi-tier checkpointing increase checkpoint frequency without slowing down training and speed up save/restore operations. AI Hypercomputer can automatically handle interruptions, choosing intelligently between resets, hot-swaps, or scaling actions. Our ML Goodput recipe, created in partnership with NVIDIA, bundles these techniques, integrating NVIDIA NeMo and the NVIDIA Resiliency Extension (NVRx) for a comprehensive solution to boost the efficiency and reliability of your PyTorch training on Google Cloud.
We also added optimized recipes (complete with checkpointing) for you to benchmark training performance for different storage options like Google Cloud Storage and Parallelstore. Lastly, we added recipes for our A4 NVIDIA accelerated instance (built on NVIDIA Blackwell). The training recipes include sparse and dense model training up to 512 Blackwell GPUs with PyTorch and JAX.
Cutting-edge JAX LLM development with MaxText
For developers who use JAX for LLMs on Google Cloud, MaxText provides advanced training, tuning, and serving on both TPUs and GPUs. Recently, we added support for key fine-tuning techniques like Supervised Fine Tuning (SFT) and Direct Preference Optimization (DPO), alongside resilient training capabilities such as suspend-resume and elastic training. MaxText leverages JAX optimizations and pipeline parallelism techniques that we developed in collaboration with NVIDIA to improve training efficiency across tens of thousands of NVIDIA GPUs. And we also added support and recipes for the latest open models: Gemma 3, Llama 4 training and inference (Scout and Maverick), and DeepSeek v3 training and inference.
To help you get the best performance with Trillium TPU, we added microbenchmarking recipes including matrix multiplication, collective compute, and high-bandwidth memory (HBM) tests scaling up to multiple slices with hundreds of accelerators. These metrics are particularly useful for performance optimization. For production workloads on GKE, be sure to take a look at automatic application monitoring.
Harness PyTorch on TPU with PyTorch/XLA 2.7 and torchprime
We’re committed to providing an integrated, high-performance experience for PyTorch users on TPUs. To that end, the recently released PyTorch/XLA 2.7 includes notable performance improvements, particularly benefiting users working with vLLM on TPU for inference. This version also adds an important new flexibility and interoperability capability: you can now call JAX functions directly from within your PyTorch/XLA code.
Then, to help you harness the power of PyTorch/XLA on TPUs, we introduced torchprime, a reference implementation for training PyTorch models on TPUs. Torchprime is designed to showcase best practices for large-scale, high-performance model training, making it a great starting point for your PyTorch/XLA development journey.
Build cutting-edge recommenders with RecML
While generative AI often captures the spotlight, highly effective recommender systems remain a cornerstone of many applications, and TPUs offer unique advantages for training them at scale. Deep-learning recommender models frequently rely on massive embedding tables to represent users, items, and their features, and processing these embeddings efficiently is crucial. This is where TPUs shine, particularly with SparseCore, a specialized integrated dataflow processor. SparseCore is purpose-built to accelerate the lookup and processing of the vast, sparse embeddings that are typical in recommenders, dramatically speeding up training compared to alternatives.
To help you leverage this power, we now offer RecML: an easy-to-use, high-performance, large-scale deep-learning recommender system library optimized for TPUs. It provides reference implementations for training state-of-the-art recommender models such as BERT4Rec, Mamba4Rec, SASRec, and HSTU. RecML uses SparseCore to maximize performance, making it easy for you to efficiently utilize the TPU hardware for faster training and scaling of your recommender models.
Build with us!
Improving the AI developer experience on Google Cloud is an ongoing mission. From scaling your interactive experiments with Pathways, to pinpointing bottlenecks with Xprof, to getting started faster with optimized containers and framework recipes, these AI Hypercomputer improvements help to remove friction so you can innovate faster, and build on the other AI Hypercomputer innovations we announced at Google Cloud Next 25:
Source Credit: https://cloud.google.com/blog/products/compute/ai-hypercomputer-enhancements-for-the-developer/