Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.siderolabs.com/llms.txt

Use this file to discover all available pages before exploring further.

Kubernetes Dynamic Resource Allocation (DRA) is a new way of sharing host level resources into pods. These resources include GPU devices, high performance networking, and other hardware that a workload may need access to. DRA is enabled by default as a beta feature in Kubernetes 1.34 and can also be enabled in Kubernetes 1.33. DRA replaces device plugins for accessing resources from workloads.

Enable DRA in the cluster

Make sure the cluster has DRA enabled (Kubernetes 1.34+) or uses the following patch to enable the feature. This patch can be applied to all nodes in the cluster via talosctl or via Omni.
machine:
  kubelet:
    extraArgs:
      feature-gates: DynamicResourceAllocation=true
cluster:
  apiServer:
    extraArgs:
      feature-gates: DynamicResourceAllocation=true
  controllerManager:
    extraArgs:
      feature-gates: DynamicResourceAllocation=true
  scheduler:
    extraArgs:
      feature-gates: DynamicResourceAllocation=true
You should have at least one node in the cluster with NVIDIA hardware and configured via NVIDIA GPU (OSS drivers) or NVIDIA GPU (Proprietary drivers) guides.

1. Deploy the NVIDIA DRA plugin via Helm

Use helm to install the DRA plugin.
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
kubectl create ns nvidia-dra-driver-gpu
kubectl label --overwrite ns nvidia-dra-driver-gpu pod-security.kubernetes.io/enforce=privileged
First we need to disable the device plugin component of GPU operator since we will be using the DRA plugin instead.
helm upgrade --wait  --install -n gpu-operator gpu-operator nvidia/gpu-operator \
  --set driver.enabled=false \
  --set toolkit.enabled=false \
  --set hostPaths.driverInstallDir=/usr/local \
  --set devicePlugin.enabled=false
helm upgrade --install -n nvidia-dra-driver-gpu \
  --set resources.gpus.enabled=true \
  --set nvidiaDriverRoot=/usr/local \
  --set gpuResourcesEnabledOverride=true \
  nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu
Note: the validation steps currently mention older versions of resource claims, follow the steps below to verify that the DRA plugin is running and managing the NVIDIA GPU resources.
Refer to the dra validation documentation.
kubectl -n nvidia-dra-driver-gpu get pods
Verify ResourceSlice objects
kubectl get resourceslices
There should be a ResourceSlice object with the gpu.nvidia.com driver.

2. Deploy test workload

Create a pod that consumes the ResourceSlice via a ResourceClaimTemplate.
---
apiVersion: v1
kind: Namespace
metadata:
  name: dra-gpu-share-test
---
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
  name: single-gpu
  namespace: dra-gpu-share-test
spec:
  spec:
    devices:
      requests:
      - name: gpu
        exactly:
          deviceClassName: gpu.nvidia.com
---
apiVersion: v1
kind: Pod
metadata:
  name: pod
  namespace: dra-gpu-share-test
  labels:
    app: pod
spec:
  containers:
  - name: ctr0
    image: ubuntu:22.04
    command: ["bash", "-c"]
    args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait"]
    resources:
      claims:
      - name: shared-gpu
  - name: ctr1
    image: ubuntu:22.04
    command: ["bash", "-c"]
    args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait"]
    resources:
      claims:
      - name: shared-gpu
  resourceClaims:
  - name: shared-gpu
    resourceClaimTemplateName: single-gpu
  tolerations:
  - key: "nvidia.com/gpu"
    operator: "Exists"
    effect: "NoSchedule"
kubectl apply -f dra-gpu-share-test.yaml
Verify that the pod is running and has access to the GPU resources.
kubectl logs pod -n dra-gpu-share-test --all-containers --prefix
The output is expected to show the same GPU UUID from both containers. Example:
[pod/pod/ctr0] GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-4404041a-04cf-1ccf-9e70-f139a9b1e23c)
[pod/pod/ctr1] GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-4404041a-04cf-1ccf-9e70-f139a9b1e23c)
clean up
kubectl delete ns dra-gpu-share-test

Notes

DRA is currently a beta Kubernetes feature and it is likely that it will change in the future. Make sure you consult your hardware vendor’s documentation for up-to-date configuration and deployment guides.