Documentation Index
Fetch the complete documentation index at: https://docs.siderolabs.com/llms.txt
Use this file to discover all available pages before exploring further.
Kubernetes Dynamic Resource Allocation (DRA) is a new way of sharing host level resources into pods.
These resources include GPU devices, high performance networking, and other hardware that a workload may need access to.
DRA is enabled by default as a beta feature in Kubernetes 1.34 and can also be enabled in Kubernetes 1.33.
DRA replaces device plugins for accessing resources from workloads.
Enable DRA in the cluster
Make sure the cluster has DRA enabled (Kubernetes 1.34+) or uses the following patch to enable the feature. This patch can be applied to all nodes in the cluster via talosctl or via Omni.
machine:
kubelet:
extraArgs:
feature-gates: DynamicResourceAllocation=true
cluster:
apiServer:
extraArgs:
feature-gates: DynamicResourceAllocation=true
controllerManager:
extraArgs:
feature-gates: DynamicResourceAllocation=true
scheduler:
extraArgs:
feature-gates: DynamicResourceAllocation=true
You should have at least one node in the cluster with NVIDIA hardware and configured via NVIDIA GPU (OSS drivers) or NVIDIA GPU (Proprietary drivers) guides.
1. Deploy the NVIDIA DRA plugin via Helm
Use helm to install the DRA plugin.
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
kubectl create ns nvidia-dra-driver-gpu
kubectl label --overwrite ns nvidia-dra-driver-gpu pod-security.kubernetes.io/enforce=privileged
First we need to disable the device plugin component of GPU operator since we will be using the DRA plugin instead.
helm upgrade --wait --install -n gpu-operator gpu-operator nvidia/gpu-operator \
--set driver.enabled=false \
--set toolkit.enabled=false \
--set hostPaths.driverInstallDir=/usr/local \
--set devicePlugin.enabled=false
helm upgrade --install -n nvidia-dra-driver-gpu \
--set resources.gpus.enabled=true \
--set nvidiaDriverRoot=/usr/local \
--set gpuResourcesEnabledOverride=true \
nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu
Note: the validation steps currently mention older versions of resource claims, follow the steps below to verify that the DRA plugin is running and managing the NVIDIA GPU resources.
Refer to the dra validation documentation.
kubectl -n nvidia-dra-driver-gpu get pods
Verify ResourceSlice objects
kubectl get resourceslices
There should be a ResourceSlice object with the gpu.nvidia.com driver.
2. Deploy test workload
Create a pod that consumes the ResourceSlice via a ResourceClaimTemplate.
---
apiVersion: v1
kind: Namespace
metadata:
name: dra-gpu-share-test
---
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
name: single-gpu
namespace: dra-gpu-share-test
spec:
spec:
devices:
requests:
- name: gpu
exactly:
deviceClassName: gpu.nvidia.com
---
apiVersion: v1
kind: Pod
metadata:
name: pod
namespace: dra-gpu-share-test
labels:
app: pod
spec:
containers:
- name: ctr0
image: ubuntu:22.04
command: ["bash", "-c"]
args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait"]
resources:
claims:
- name: shared-gpu
- name: ctr1
image: ubuntu:22.04
command: ["bash", "-c"]
args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait"]
resources:
claims:
- name: shared-gpu
resourceClaims:
- name: shared-gpu
resourceClaimTemplateName: single-gpu
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
kubectl apply -f dra-gpu-share-test.yaml
Verify that the pod is running and has access to the GPU resources.
kubectl logs pod -n dra-gpu-share-test --all-containers --prefix
The output is expected to show the same GPU UUID from both containers. Example:
[pod/pod/ctr0] GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-4404041a-04cf-1ccf-9e70-f139a9b1e23c)
[pod/pod/ctr1] GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-4404041a-04cf-1ccf-9e70-f139a9b1e23c)
clean up
kubectl delete ns dra-gpu-share-test
Notes
DRA is currently a beta Kubernetes feature and it is likely that it will change in the future.
Make sure you consult your hardware vendor’s documentation for up-to-date configuration and deployment guides.