NVIDIA GPU (OSS drivers) - Sidero Documentation

Enabling NVIDIA GPU support on Talos is bound by NVIDIA EULA. The Talos published NVIDIA OSS drivers are bound to a specific Talos release. The extensions versions also needs to be updated when upgrading Talos.

We will be using the following NVIDIA OSS system extensions:

nvidia-open-gpu-kernel-modules
nvidia-container-toolkit

Create the boot assets which includes the system extensions mentioned above (or create a custom installer and perform a machine upgrade if Talos is already installed).

Make sure the driver version matches for both the nvidia-open-gpu-kernel-modules and nvidia-container-toolkit extensions. The nvidia-open-gpu-kernel-modules extension is versioned as <nvidia-driver-version>-<talos-release-version> and the nvidia-container-toolkit extension is versioned as <nvidia-driver-version>-<nvidia-container-toolkit-version>.

Proprietary vs OSS Nvidia Driver Support

The NVIDIA Linux GPU Driver contains several kernel modules: nvidia.ko, nvidia-modeset.ko, nvidia-uvm.ko, nvidia-drm.ko, and nvidia-peermem.ko. Two “flavors” of these kernel modules are provided, and both are available for use within Talos:

Proprietary, This is the flavor that NVIDIA has historically shipped.
Open, i.e. source-published/OSS, kernel modules that are dual licensed MIT/GPLv2. With every driver release, the source code to the open kernel modules is published on https://github.com/NVIDIA/open-gpu-kernel-modules and a tarball is provided on https://download.nvidia.com/XFree86/.

The choice between Proprietary/OSS may be decided after referencing the Official NVIDIA announcement.

Some hardware may require additional system configuration and will not work with OSS drivers. Grace Blackwell (GB10) devices running with arm64 CPUs like the Nvidia DGX Spark require Nvidia proprietary drivers.

Enabling the NVIDIA OSS modules

Patch Talos machine configuration using the patch gpu-worker-patch.yaml:

machine:
  kernel:
    modules:
      - name: nvidia
      - name: nvidia_uvm
      - name: nvidia_drm
      - name: nvidia_modeset

Now apply the patch to all Talos nodes in the cluster having NVIDIA GPU’s installed:

talosctl patch mc --patch @gpu-worker-patch.yaml

The NVIDIA modules should be loaded and the system extension should be installed. This can be confirmed by running:

talosctl get modules

which should produce an output similar to below:

NODE       NAMESPACE   TYPE                 ID                     VERSION   STATE
5.0.3   runtime     LoadedKernelModule   nvidia_uvm             1         Live
5.0.3   runtime     LoadedKernelModule   nvidia_drm             1         Live
5.0.3   runtime     LoadedKernelModule   nvidia_modeset         1         Live
5.0.3   runtime     LoadedKernelModule   nvidia                 1         Live

talosctl get extensions

which should produce an output similar to below:

NODE           NAMESPACE   TYPE              ID                                                                           VERSION   NAME                             VERSION
172.31.41.27   runtime     ExtensionStatus   000.ghcr.io-siderolabs-nvidia-container-toolkit-515.65.01-v1.10.0            1         nvidia-container-toolkit         515.65.01-v1.10.0
172.31.41.27   runtime     ExtensionStatus   000.ghcr.io-siderolabs-nvidia-open-gpu-kernel-modules-515.65.01-v1.2.0       1         nvidia-open-gpu-kernel-modules   515.65.01-v1.2.0

Deploying NVIDIA GPU Operator

Follow the upstream instructions with only passing Helm chart values specific to Talos. Disable the driver and toolkit components of the GPU operator since we have already enabled them as system extensions on Talos. Further custom values may be needed to be passed to the helm chart depending on the cluster configuration.

kubectl create ns gpu-operator
kubectl label --overwrite ns gpu-operator pod-security.kubernetes.io/enforce=privileged

Note: Make sure to install GPU operator version v26.3.1 or higher

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm upgrade --wait  --install -n gpu-operator gpu-operator nvidia/gpu-operator \
  --set driver.enabled=false \
  --set toolkit.enabled=false \
  --set hostPaths.driverInstallDir=/usr/local

Verification

Follow the upstream instructions to verify the GPU operator installation.

Collecting NVIDIA GPU debug data

When debugging NVIDIA GPU issues (for example, NVRM: GPU has fallen off the bus messages in the kernel log), NVIDIA support will often ask for the output of nvidia-bug-report.sh. Talos does not allow direct shell access on the nodes, but you can still generate this report by following the steps below:

Start a debug pod on the affected node:

 kubectl -n kube-system \
    run debug-gpu \
    --rm -it \
    --image=ubuntu \
    --overrides='{
      "spec": {
        "runtimeClassName": "nvidia",
        "hostPID": true,
        "hostNetwork": true,
        "containers": [{
          "name": "debug-gpu",
          "image": "ubuntu",
          "stdin": true,
          "tty": true,
          "securityContext": {
            "privileged": true
          },
          "volumeMounts": [{
            "name": "host-root",
            "mountPath": "/host"
          }]
        }],
        "volumes": [{
          "name": "host-root",
          "hostPath": {
            "path": "/"
          }
        }]
      }
    }' \
    --restart=Never \
    -- /bin/bash

This will drop you into a shell inside a container running on the node. From here, you can install the necessary tools to run nvidia-bug-report.sh and generate the report.

apt update && apt install --no-install-recommends -y \
    dmidecode \
    pciutils \
    usbutils \
    mesa-utils \
    kmod \
    vulkan-tools \
    infiniband-diags \
    acpidump \
    mstflint

Inside the debug container, run nvidia-bug-report.sh:

/host/usr/local/bin/nvidia-bug-report.sh

This will generate nvidia-bug-report.log.gz in the current directory.

To copy the report of the cluster:

From your local machine, run the following command to copy the report from the debug pod to your local machine:

kubectl cp \
"kube-system/debug-gpu:/nvidia-bug-report.log.gz" \
./nvidia-bug-report.log.gz

You can now upload nvidia-bug-report.log.gz to NVIDIA support.

​Proprietary vs OSS Nvidia Driver Support

​Enabling the NVIDIA OSS modules

​Deploying NVIDIA GPU Operator

​Verification

​Collecting NVIDIA GPU debug data

Proprietary vs OSS Nvidia Driver Support

Enabling the NVIDIA OSS modules

Deploying NVIDIA GPU Operator

Verification

Collecting NVIDIA GPU debug data