
Want to save some money on large AI training? For a typical PyTorch LLM training workload that spans thousands of accelerators for several weeks, a 1% improvement in ML Goodput can translate to more than a million dollars in cost savings1. Therefore, improving ML Goodput is an important goal for model training — both from an efficiency perspective, as well as for model iteration velocity.
However, there are several challenges to improving ML Goodput today: frequent interruptions that necessitate restarts from the latest checkpoint, slow inline checkpointing that interrupts training, and limited observability that makes it difficult to detect failures. These issues contribute to a significant increase in the time-to-market (TTM) and cost-to-train. There have been several industry publications articulating these issues, e.g., this Arxiv paper.
Improving ML Goodput
In order to improve ML Goodput, you need to minimize the impact of disruptive events on the progress of the training workload. To resume a job quickly, you can automatically scale down the job, or swap failed resources from spare capacity. At Google Cloud, we call this elastic training. Further, you can reduce workload interruptions during checkpointing and speed up checkpoint loads on failures from the nearest available storage location. We call these capabilities asynchronous checkpointing and multi-tier checkpointing.
The following picture illustrates how these techniques provide an end-to-end remediation workflow to improve ML Goodput for training. An example workload of nine nodes is depicted with three-way data parallelism (DP) and three-way pipeline parallelism (PP), with various remediation actions shown based on the failures and spare capacity.
Source Credit: https://cloud.google.com/blog/products/ai-machine-learning/elastic-training-and-optimized-checkpointing-improve-ml-goodput/