New monitoring library to optimize Google Cloud TPU resources

For more than a decade, TPUs have powered Google’s most demanding AI training and serving workloads. And there is strong demand from customers for Cloud TPUs as well. When running advanced AI workloads, you need to be able to monitor and optimize the efficiency of your training and inference jobs, and swiftly diagnose performance bottlenecks, node degradation, and host interruptions. Ultimately, you need real-time optimization logic built into your training and inference pipelines so you can maximize the efficiency of your applications — whether you’re optimizing for ML Goodput, operational cost, or time-to-market.

Today, we’re thrilled to introduce a new monitoring library for Google Cloud TPUs, a new set of observability and diagnostic tools that provide granular, integrated performance and accelerator utilization insights so you can continuously assess and improve the efficiency of your Cloud TPU workloads.

Note: If you have shell access to the TPU VM and just need some diagnostic information (e.g., to observe memory usage for a running process), you can use tpu-info, a command-line tool for viewing TPU metrics.

Unlocking dynamic optimization: Key metrics in action

The monitoring library provides snapshot-mode access to a rich set of metrics, such as Tensor core utilization, high-bandwidth memory (HBM) usage, and buffer transfer latency. Metrics are sampled every second (1 Hz) for consistency. See the documentation for a full list of metrics and how you can use them.

You can use these metrics in your code directly to dynamically optimize for greater efficiency. For instance, if your duty_cycle_pct (a measure of utilization) is consistently low, you can programmatically adjust your data pipeline or increase batch size to better saturate the Tensor core. If hbm_capacity_usage approaches limits, your code could trigger a dynamic reduction in model size or activate memory-saving strategies to avoid out-of-memory errors. Similarly, hlo_exec_timing (how long operations are taking to execute on the accelerator) and hlo_queue_size (how many operations are waiting to be executed) can inform runtime adjustments to communication patterns or workload distribution based on observed bottlenecks.

Let’s see how to set up the library with a couple of realistic examples.

Source Credit: https://cloud.google.com/blog/products/compute/new-monitoring-library-to-optimize-google-cloud-tpu-resources/