The complexity of the infrastructure behind AI training and high performance computing (HPC) workloads can really slow teams down. At Google Cloud, where we work with some of the world’s largest AI research teams, we see it everywhere we go: researchers hampered by complex configuration files, platform teams struggling to manage GPUs with home-grown scripts, and operational leads battling the constant, unpredictable hardware failures that derail multi-week training runs. Access to raw compute isn’t enough. To operate at the cutting edge, you need reliability that survives hardware failures, orchestration that respects topology, and a lifecycle management strategy that adapts to evolving needs.
Today, we are delivering on those requirements with the General Availability (GA) of Cluster Director and the Preview of Cluster Director support for Slurm on Google Kubernetes Engine (GKE).
- Cluster Director (GA) is a managed infrastructure service designed to meet the rigorous demands of modern supercomputing. It replaces fragile DIY tooling with a robust topology-aware control plane that handles the entire lifecycle of Slurm clusters, from the first deployment to the thousandth training run.
- We are expanding Cluster Director to support Slurm on GKE (Preview), designed to give you the best of both worlds: the familiar precision of high-performance scheduling and the automated scale of Kubernetes. It achieves this by treating GKE node pools as a direct compute resource for your Slurm cluster, allowing you to scale your workloads with Kubernetes’ power without changing your existing Slurm workflows.
Cluster Director, now GA
Cluster Director offers advanced capabilities at each phase of the cluster lifecycle, spanning preparation (Day 0), where infrastructure design and capacity are determined; deployment (Day 1), where the cluster is automatically deployed and configured; and monitoring (Day 2), where performance, health, and optimization are continuously tracked.
This holistic approach ensures that you get the benefits of fully configurable infrastructure while automating lower-level operations so your compute resources are always optimized, reliable, and available.
So, what does all this cost? That’s the best part. There’s no extra charge to use Cluster Director. You only pay for the underlying Google Cloud resources — your compute, storage, and networking.
How Cluster Director supports each phase of deployment
Day 0: Preparation
Standing up a cluster typically involves weeks of planning, wrangling Terraform, and debugging the network. Cluster Director changes the ‘Day 0’ experience entirely, with tools for designing infrastructure topology that’s optimized for your workload requirements.
Source Credit: https://cloud.google.com/blog/products/compute/cluster-director-is-now-generally-available/
