
Today, we’re making inference even easier and more cost-effective, by making vLLM fully scalable with Kubernetes-native distributed and disaggregated inference. This new project is called llm-d. Google Cloud is a founding contributor alongside Red Hat, IBM Research, NVIDIA, and CoreWeave, joined by other industry leaders AMD, Cisco, Hugging Face, Intel, Lambda, and Mistral AI. Google has a long history of founding and contributing to key open-source projects that have shaped the cloud, such as Kubernetes, JAX, and Istio, and is committed to being the best platform for AI development. We believe that making llm-d open-source, and community-led, is the best way to make it widely available, so you can run it everywhere and know that a strong community supports it.
llm-d builds upon vLLM’s highly efficient inference engine, adding Google’s proven technology and extensive experience in securely and cost-effectively serving AI at billion-user scale. llm-d includes three major innovations: First, instead of traditional round-robin load balancing, llm-d includes a vLLM-aware inference scheduler, which enables routing requests to instances with prefix-cache hits and low load, achieving latency SLOs with fewer hardware resources. Second, to serve longer requests with higher throughput and lower latency, llm-d supports disaggregated serving, which handles the prefill and decode stages of LLM inference with independent instances. Third, llm-d introduces a multi-tier KV cache for intermediate values (prefixes) to improve response time across different storage tiers and reduce storage costs. llm-d works across frameworks (PyTorch today, JAX later this year), and both GPU and TPU accelerators, to provide choice and flexibility.
Source Credit: https://cloud.google.com/blog/products/ai-machine-learning/enhancing-vllm-for-distributed-inference-with-llm-d/