Benchmarking a 65,000-node GKE cluster with AI workloads

In phase #1, we see a smooth workload creation process in which pods are created pretty quickly, and scheduled with only minor delay. The process takes ~2.5m to create, schedule and run 65,000 Pods on an empty cluster (with caveats — see the previous section).

In phase #2, we observe a similar smooth creation process for the training workload, with 50,000 Pods created in under 2 min from an empty cluster. Moreover, we observe the creation of 15,000 Pods for the inference workload in under a minute from a nearly full cluster, demonstrating the fast scheduling even when the cluster is not empty.

In Phase #2, both training and inference workloads were scheduled quickly. Notably, 15,000 inference Pods were created in under a minute on a nearly full cluster, demonstrating fast scheduling even on a non-empty cluster.

During phase #3, we observe the scale up of the inference workload to 65,000 Pods and the disruption and termination of the training workload. Scheduling inference Pods suffers some delay compared to phase 2 due to waiting for the training Pods to be evicted from the Nodes. Nonetheless, the entire startup process for the inference workload takes less than four minutes in total.

After terminating and recreating the training workload, we observe its Pods in pending state (as seen between 7:20 and 7:25 in the graph, with the dotted blue representing created training pods, at 50,000 and the dotted orange representing the running training with Pods at 0) while the higher-priority inference workload occupies the full 65,000 Nodes.

Cluster performance

We use the metrics collected by Prometheus for information about control-plane performance across the experiment’s timeline. For example, you can see the P99 API call latency across various resources, where all API calls, including write calls, are under 400 ms latency — well within the 1s threshold; this satisfies the OSS SLO for resource-scoped API calls.

Source Credit: https://cloud.google.com/blog/products/containers-kubernetes/benchmarking-a-65000-node-gke-cluster-with-ai-workloads/