The explosion of large language models (LLMs) has increased demand for high-performance accelerators like GPUs and TPUs. As organizations scale their AI capabilities, the scarcity of compute resources is sometimes the primary bottleneck. Efficiently managing every GPU and TPU cycle is no longer just a recommendation — it’s an operational necessity.
Kubernetes is becoming the de facto platform for running LLMs in the enterprise. This week at KubeCon Europe, NVIDIA donated its Dynamic Resource Allocation (DRA) Driver for GPUs to the Kubernetes community, and Google donated the DRA driver for Tensor Processing Units (TPUs). These donations foster a broader community, accelerate innovation, and help ensure Kubernetes aligns with the modern cloud landscape, improving AI workload portability for Kubernetes. DRA is also generally available in Google Kubernetes Engine (GKE). In the rest of this blog, let’s take a deeper look at DRA — why it was built, what it accomplishes, and how to use it.
Moving beyond static infrastructure
For years, Kubernetes’ Device Plugin framework was the standard way to consume hardware accelerators. However, Device Plugins only allow you to express hardware requirements as simple integers (e.g., gpu: 1) — no fractional GPUs! This is not granular or subtle enough for modern, complex workloads. Device Plugin also requires the cluster to have the accelerators pre-provisioned before the pods can be scheduled.
As the new Kubernetes standard for resource management, DRA reached “stable” status in Kubernetes OSS 1.34. DRA represents a paradigm shift in how to handle hardware, moving from static assignments to a flexible, request-based model. This solves several pain points, namely:
-
Eliminates manual node pinning: Under the Device Plugin framework, app operators had to manually research which nodes possessed specific hardware and then use nodeSelectors or affinities to ensure their pods landed there. DRA automates this by making the scheduler natively aware of specific hardware capabilities. It finds the right node for the workload based on the request, rather than requiring the user to map out the cluster’s topology.
-
Offers flexible parameterization: Unlike Device Plugins’ “all-or-nothing” approach, DRA allows users to define specific requirements — such as a minimum amount of VRAM, a specific hardware model, or interconnect requirements — through ResourceClaims. This allows for a much more granular and efficient use of expensive hardware.
-
Abstracts hardware via DeviceClasses: DRA introduces the DeviceClass, which acts as a “blueprint” for hardware. Platform admins can define classes (e.g., high-memory-gpu or low-latency-fpga) that developers request by name. This decouples the workload’s needs from the underlying hardware addresses, allowing the scheduler to match workload requirements to available hardware inventory.
Deep dive: How DRA works
At the heart of DRA are two primary building blocks that separate hardware inventory from workload requirements: ResourceSlice and ResourceClaim. These are the inputs the Kube-scheduler uses to make better decisions and enable a more flexible resource pool.
ResourceSlice: Describing availability
The ResourceSlice API is how resource drivers publish the capabilities and attributes of the underlying hardware to the cluster. Unlike Device Plugins, which often hide device details behind simple labels, ResourceSlices provide a high-fidelity description of available assets. This allows drivers to report granular details about each device, such as:
-
Capacity: Total memory, number of cores, or specialized compute units
-
Attributes: Architecture, version, PCIe Root Complex or NUMA node
ResourceClaim: Defining requirements
The ResourceClaim API allows AI engineers to define exactly what their application needs to run successfully. Because it expects the details exposed by the ResourceSlice API, developers can move beyond generic requests, and specify requirements based on:
-
Attribute-based selections: Instead of naming a specific model, a user can request, e.g., “any GPU with at least 40 GB of VRAM.”
-
Complex constraints: DRA supports inter-device constraints. For example, a high-performance computing job can request a GPU and a NIC with the requirement that both are attached to the same PCIe Root Complex to minimize latency and maximize throughput.
Smarter scheduling through capabilities
By decoupling the “what” (ResourceClaim) from the “where” (ResourceSlice), DRA shifts the burden of device matching from the user to the Kube-scheduler.
Previously, users often had to rely on manual node selectors or taints to land pods on the right hardware. With DRA, the scheduler gains a global view of device attributes and cluster topology. This enables a more “liquid” resource pool: the scheduler can evaluate the specific criteria of a claim against all available slices, optimizing placement based on actual hardware availability rather than static labels.
This capability-based approach ensures that workloads are matched with the most suitable available hardware, improving both resource utilization and application performance.
Source Credit: https://cloud.google.com/blog/products/containers-kubernetes/kubernetes-device-management-with-dra-dynamic-resource-allocation/
