Documentation Index
Fetch the complete documentation index at: https://docs.siderolabs.com/llms.txt
Use this file to discover all available pages before exploring further.
Enabling NVIDIA GPU support on Talos is bound by NVIDIA EULA.
The Talos published NVIDIA OSS drivers are bound to a specific Talos release.
The extensions versions also needs to be updated when upgrading Talos.
We will be using the following NVIDIA OSS system extensions:
nvidia-open-gpu-kernel-modules
nvidia-container-toolkit
Create the boot assets which includes the system extensions mentioned above (or create a custom installer and perform a machine upgrade if Talos is already installed).
Make sure the driver version matches for both the nvidia-open-gpu-kernel-modules and nvidia-container-toolkit extensions.
The nvidia-open-gpu-kernel-modules extension is versioned as <nvidia-driver-version>-<talos-release-version> and the nvidia-container-toolkit extension is versioned as <nvidia-driver-version>-<nvidia-container-toolkit-version>.
Proprietary vs OSS Nvidia Driver Support
The NVIDIA Linux GPU Driver contains several kernel modules: nvidia.ko, nvidia-modeset.ko, nvidia-uvm.ko, nvidia-drm.ko, and nvidia-peermem.ko.
Two “flavors” of these kernel modules are provided, and both are available for use within Talos:
The choice between Proprietary/OSS may be decided after referencing the Official NVIDIA announcement.
Enabling the NVIDIA OSS modules
Patch Talos machine configuration using the patch gpu-worker-patch.yaml:
machine:
kernel:
modules:
- name: nvidia
- name: nvidia_uvm
- name: nvidia_drm
- name: nvidia_modeset
sysctls:
net.core.bpf_jit_harden: 1
Now apply the patch to all Talos nodes in the cluster having NVIDIA GPU’s installed:
talosctl patch mc --patch @gpu-worker-patch.yaml
The NVIDIA modules should be loaded and the system extension should be installed.
This can be confirmed by running:
which should produce an output similar to below:
NODE NAMESPACE TYPE ID VERSION STATE
10.5.0.3 runtime LoadedKernelModule nvidia_uvm 1 Live
10.5.0.3 runtime LoadedKernelModule nvidia_drm 1 Live
10.5.0.3 runtime LoadedKernelModule nvidia_modeset 1 Live
10.5.0.3 runtime LoadedKernelModule nvidia 1 Live
which should produce an output similar to below:
NODE NAMESPACE TYPE ID VERSION NAME VERSION
172.31.41.27 runtime ExtensionStatus 000.ghcr.io-siderolabs-nvidia-container-toolkit-515.65.01-v1.10.0 1 nvidia-container-toolkit 515.65.01-v1.10.0
172.31.41.27 runtime ExtensionStatus 000.ghcr.io-siderolabs-nvidia-open-gpu-kernel-modules-515.65.01-v1.2.0 1 nvidia-open-gpu-kernel-modules 515.65.01-v1.2.0
Deploying NVIDIA device plugin
First we need to create the RuntimeClass
Apply the following manifest to create a runtime class that uses the extension:
---
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: nvidia
handler: nvidia
Install the NVIDIA device plugin:
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update
helm install nvidia-device-plugin nvdp/nvidia-device-plugin --version=0.13.0 --set=runtimeClassName=nvidia
(Optional) Setting the default runtime class as nvidia
Do note that this will set the default runtime class to nvidia for all pods scheduled on the node.
Create a patch yaml nvidia-default-runtimeclass.yaml to update the machine config similar to below:
- op: add
path: /machine/files
value:
- content: |
[plugins]
[plugins."io.containerd.cri.v1.runtime"]
[plugins."io.containerd.cri.v1.runtime".containerd]
default_runtime_name = "nvidia"
path: /etc/cri/conf.d/20-customization.part
op: create
Now apply the patch to all Talos nodes in the cluster having NVIDIA GPU’s installed:
talosctl patch mc --patch @nvidia-default-runtimeclass.yaml
Testing the runtime class
Note the spec.runtimeClassName being explicitly set to nvidia in the pod spec.
Run the following command to test the runtime class:
kubectl run \
nvidia-test \
--restart=Never \
-ti --rm \
--image nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0 \
--overrides '{"spec": {"runtimeClassName": "nvidia"}}' \
nvidia-smi
Collecting NVIDIA GPU debug data
When debugging NVIDIA GPU issues (for example, NVRM: GPU has fallen off the bus messages in the kernel log), NVIDIA support will often ask for the output of nvidia-bug-report.sh.
Talos does not allow direct shell access on the nodes, but you can still generate this report by using kubectl debug. To do this:
- Start a debug pod on the affected node:
# Replace ${USER} with any unique suffix if needed
kubectl run "node-debugger-${USER}" \
--restart=Never \
--namespace kube-system \
--image nvcr.io/nvidia/cuda:12.8.0-base-ubuntu22.04
- Then attach to it with the sysadmin debug profile:
kubectl debug "node-debugger-${USER}" \
-it \
--namespace kube-system \
--profile sysadmin
This will drop you into a shell inside a container running on the node.
- Inside the debug container, download the NVIDIA driver bundle and extract
nvidia-bug-report.sh by running the following commands:
a. Confirm the driver version talos is using:
nvidia-smi --query-gpu=driver_version --format=csv,noheader
b. Set the driver version and node architecture in variables:
DRIVER_VERSION=<nvidia-driver-version>
ARCH=<node-architecture>
Replace the placeholders <nvidia-driver-version> and <node-architecture> with your actual values:
<nvidia-driver-version>: The nvidia driver version running on your talos nodes, which you found in step 3a.
<node-architecture>: The architecture of the node.
c. Download the nvidia driver bundle, and extract the bug report script:
apt-get update && apt-get install -y curl
curl -O "https://us.download.nvidia.com/XFree86/Linux-${ARCH}/${DRIVER_VERSION}/NVIDIA-Linux-${ARCH}-${DRIVER_VERSION}.run"
sh NVIDIA-Linux-${ARCH}-${DRIVER_VERSION}.run --extract-only
cp NVIDIA-Linux-${ARCH}-${DRIVER_VERSION}/nvidia-bug-report.sh /usr/bin/
- Run
nvidia-bug-report.sh:
This will generate nvidia-bug-report.log.gz in the current directory.
- To copy the report of the cluster:
a. First, find the name of the debug container (if needed):
kubectl get pod "node-debugger-${USER}" \
--namespace kube-system \
-o jsonpath='{.spec.containers[*].name}'
b. Then, from your workstation:
kubectl cp \
"kube-system/node-debugger-${USER}:/nvidia-bug-report.log.gz" \
./nvidia-bug-report.log.gz
You can now upload nvidia-bug-report.log.gz to NVIDIA support.