The chart above shows how quickly the model streamer can fetch a 141GB Llama 3.3-7 70B model from Cloud Storage as compared to the default vLLM model loader (lower is better).
Boost resilience and scalability with fewer cold starts
For an inference server running on Kubernetes, a “cold start” involves several steps: pulling the container image, starting the process, and — most time-consuming of all — loading the model weights into GPU memory. For large models, this loading phase can take many minutes, with painful consequences such as slow auto-scaling and idling GPUs as they wait for the workload to start up.
By streaming the model into GPU memory, the model streamer slashes potentially the most time-consuming part of the startup process. Instead of waiting for an entire model to be downloaded before loading, the streamer fetches model tensors directly from object storage and streams them concurrently to GPU memory. This dramatically reduces model loading times from minutes to seconds.
For workloads that rely on model parallelism— where a single model is partitioned and executed across multiple GPUs— the model streamer goes a step further. Its distributed streaming capability is optimized to take full advantage of NVIDIA NVLink, using high-bandwidth GPU-to-GPU communication to coordinate loading across multiple processes. Reading the weights from storage is divided efficiently and evenly across all participating processes, with each one fetching a portion of the model weights from storage and then sharing its segment with the others over NVLink. This allows even multi-GPU deployments to benefit from faster startups and fewer cold-start bottlenecks.
Performance and simplicity
The latest updates to the Model Streamer introduce first-class support for Cloud Storage, creating an integrated and high-performance experience for Google Cloud users. This integration is designed to be simple, fast, and secure, especially for workloads running on GKE.
For users of popular inference servers like vLLM, enabling the streamer is as simple as adding a single flag to your vLLM command line:
--load-format=runai_streamer
Here’s how easy it is to launch a model stored in a Cloud Storage bucket with vLLM:
Source Credit: https://cloud.google.com/blog/products/containers-kubernetes/nvidia-runai-model-streamer-supports-cloud-storage/
