
Performance tuning
Out of the box, Cloud Storage FUSE is a convenient way to access your models. But to unlock its full potential for read-heavy inference workloads, you need to tune its caching and prefetching capabilities.
-
Parallel downloads: For very large model files, you can enable parallel downloads to accelerate the initial read from Cloud Storage into the local file cache. This is enabled by default when file caching is enabled.
-
Metadata caching & prefetching: The first time that you access a file, FUSE needs to get its metadata (like size and permissions) from Cloud Storage. To keep the metadata in memory, you can configure a
stat
cache. For even better performance, you can enable metadata prefetching, which proactively loads the metadata for all files in a directory when the volume is mounted. You can enable metadata prefetching by setting themetadata-cache:stat-cache-max-size-mb
andmetadata-cache:ttl-secs
options in yourmountOptions
configuration.
For more information, see the Performance tuning best practices in the Cloud Storage documentation. For an example of a GKE Deployment
manifest that mounts a Cloud Storage bucket with performance-tuned FUSE settings, see the sample configuration YAML files.
Advanced storage on GKE
Cloud Storage FUSE offers a direct and convenient way to access model artifacts. GKE also provides specialized, high-performance storage solutions designed to eliminate I/O bottlenecks for the most demanding AI/ML workloads. These options, Google Cloud Managed Lustre, and Hyperdisk ML, offer alternatives that can provide high performance and stability by leveraging dedicated parallel file and block storage.
Managed Lustre
For the most extreme performance requirements, Google Cloud Managed Lustre provides a fully managed, parallel file system. Managed Lustre is designed for workloads that demand ultra-low, sub-millisecond latency and massive IOPS, such as HPC simulations and AI training and inference jobs. It’s POSIX-compliant, which ensures compatibility with existing applications and workflows.
This service, powered by DDN’s EXAScaler, scales to multiple PBs and streams data up to 1 TB/s, making it ideal for large-scale AI jobs that need to feed hungry GPUs or TPUs. It’s intended for high-throughput data access rather than long-term storage archiving. Although its primary use case is persistent storage for training data and checkpoints, it can handle millions of small files and random reads with extremely low latency and high throughput. It’s therefore a powerful tool for complex inference pipelines that might need to read or write many intermediate files.
To use Managed Lustre with GKE, you first enable the Managed Lustre CSI driver on your GKE cluster. Then, you define a StorageClass
resource that references the driver and a PersistentVolumeClaim
request to either dynamically provision a new Lustre instance or connect to an existing one. Finally, you mount the PersistentVolumeClaim
as a volume in your pods, which lets them access the high-throughput, low-latency parallel file system.
Hyperdisk ML
Hyperdisk ML is a network block storage option that’s purpose-built for AI/ML workloads, particularly for accelerating the loading of static data like model weights. Unlike Cloud Storage FUSE, which provides a file system interface to an object store, Hyperdisk ML provides a high-performance block device that can be pre-loaded, or hydrated, with model artifacts from Cloud Storage.
Its standout feature for inference serving is its support for READ_ONLY_MANY
access, which allows a single Hyperdisk ML volume to be attached as a read-only device to up to 2,500 GKE nodes concurrently. In this architecture, every pod can access the same centralized, high-performance copy of the model artifact without duplication. You can therefore use it to scale out stateless inference services that provide high throughput with smaller TB sized volumes. Note that the read-only nature of Hyperdisk ML introduces operational process changes each time a model is updated.
To integrate Hyperdisk ML, you first create a Hyperdisk ML volume and populate it with your model artifacts from Cloud Storage. Then you define a StorageClass
resource and a PersistentVolumeClaim
request in your GKE cluster to make the volume available to your pods. Finally, you mount the PersistentVolumeClaim
as a volume in your Deployment
manifest.
Serving the artifact on Cloud Run
Cloud Run also supports mounting Cloud Storage buckets as volumes, which makes it a viable platform for serving ML models, especially with the addition of GPU support. You can configure a Cloud Storage volume mount directly in your Cloud Run service definition. This implementation provides a simple and effective way to give your serverless application access to the models that are stored in Cloud Storage.
Here is an example of how to mount a Cloud Storage bucket as a volume in a Cloud Run service by using the gcloud
command-line tool:
Source Credit: https://cloud.google.com/blog/topics/developers-practitioners/scalable-ai-starts-with-storage-guide-to-model-artifact-strategies/