> ## Documentation Index
> Fetch the complete documentation index at: https://docs.siderolabs.com/llms.txt
> Use this file to discover all available pages before exploring further.

# NVIDIA GPU (OSS drivers)

> In this guide we'll follow the procedure to support NVIDIA GPU using OSS drivers on Talos.

export const VersionWarningBanner = () => {
  const latestVersion = "v1.13";
  const [latestUrl, setLatestUrl] = useState(null);
  const [currentVersion, setCurrentVersion] = useState(null);
  const [isBeta, setIsBeta] = useState(false);
  const parseVersion = v => v.replace("v", "").split(".").map(Number);
  const isGreaterVersion = (a, b) => {
    const [aMajor, aMinor] = parseVersion(a);
    const [bMajor, bMinor] = parseVersion(b);
    if (aMajor > bMajor) return true;
    if (aMajor === bMajor && aMinor > bMinor) return true;
    return false;
  };
  useEffect(() => {
    if (typeof window === "undefined") return;
    const {pathname, hash, search} = window.location;
    const match = pathname.match(/\/talos\/(v\d+\.\d+)\//);
    if (!match) return;
    const detectedVersion = match[1];
    if (detectedVersion === latestVersion) return;
    setCurrentVersion(detectedVersion);
    if (isGreaterVersion(detectedVersion, latestVersion)) {
      setIsBeta(true);
    }
    const newPath = pathname.replace(`/talos/${detectedVersion}/`, `/talos/${latestVersion}/`);
    setLatestUrl(`${newPath}${search}${hash}`);
  }, []);
  if (!latestUrl || !currentVersion) return null;
  return <div className="not-prose sticky top-6 z-50 my-6">
      <div className="border border-yellow-500/30 bg-yellow-500/10 px-4 py-3 rounded-xl">
        <div className="text-sm">
          {isBeta ? <>
              ⚠️ You are viewing a <strong>beta version</strong> of Talos ({currentVersion}).
              This version may be unstable.
              <a href={latestUrl} className="ml-2 underline text-yellow-400 hover:text-yellow-300 font-medium">
                View latest stable version {latestVersion} →
              </a>
            </> : <>
              ⚠️ You are viewing an older version of Talos ({currentVersion}).
              <a href={latestUrl} className="ml-2 underline text-yellow-400 hover:text-yellow-300 font-medium">
                View the latest version {latestVersion} →
              </a>
            </>}
        </div>
      </div>
    </div>;
};

<VersionWarningBanner />

> Enabling NVIDIA GPU support on Talos is bound by [NVIDIA EULA](https://www.nvidia.com/en-us/drivers/nvidia-license/).
> The Talos published NVIDIA OSS drivers are bound to a specific Talos release.
> The extensions versions also needs to be updated when upgrading Talos.

We will be using the following NVIDIA OSS system extensions:

* `nvidia-open-gpu-kernel-modules`
* `nvidia-container-toolkit`

Create the [boot assets](../../platform-specific-installations/boot-assets) which includes the system extensions mentioned above (or create a custom installer and perform a machine upgrade if Talos is already installed).

> Make sure the driver version matches for both the `nvidia-open-gpu-kernel-modules` and `nvidia-container-toolkit` extensions.
> The `nvidia-open-gpu-kernel-modules` extension is versioned as `<nvidia-driver-version>-<talos-release-version>` and the `nvidia-container-toolkit` extension is versioned as `<nvidia-driver-version>-<nvidia-container-toolkit-version>`.

## Proprietary vs OSS Nvidia Driver Support

The NVIDIA Linux GPU Driver contains several kernel modules: `nvidia.ko`, `nvidia-modeset.ko`, `nvidia-uvm.ko`, `nvidia-drm.ko`, and `nvidia-peermem.ko`.
Two "flavors" of these kernel modules are provided, and both are available for use within Talos:

* Proprietary, This is the flavor that NVIDIA has historically shipped.
* Open, i.e. source-published/OSS, kernel modules that are dual licensed MIT/GPLv2.
  With every driver release, the source code to the open kernel modules is published on [https://github.com/NVIDIA/open-gpu-kernel-modules](https://github.com/NVIDIA/open-gpu-kernel-modules) and a tarball is provided on [https://download.nvidia.com/XFree86/](https://download.nvidia.com/XFree86/).

The choice between Proprietary/OSS may be decided after referencing the Official [NVIDIA announcement](https://developer.nvidia.com/blog/nvidia-transitions-fully-towards-open-source-gpu-kernel-modules/).

## Enabling the NVIDIA OSS modules

Patch Talos machine configuration using the patch `gpu-worker-patch.yaml`:

```yaml theme={null}
machine:
  kernel:
    modules:
      - name: nvidia
      - name: nvidia_uvm
      - name: nvidia_drm
      - name: nvidia_modeset
  sysctls:
    net.core.bpf_jit_harden: 1
```

Now apply the patch to all Talos nodes in the cluster having NVIDIA GPU's installed:

```bash theme={null}
talosctl patch mc --patch @gpu-worker-patch.yaml
```

The NVIDIA modules should be loaded and the system extension should be installed.

This can be confirmed by running:

```bash theme={null}
talosctl get modules
```

which should produce an output similar to below:

```text theme={null}
NODE       NAMESPACE   TYPE                 ID                     VERSION   STATE
10.5.0.3   runtime     LoadedKernelModule   nvidia_uvm             1         Live
10.5.0.3   runtime     LoadedKernelModule   nvidia_drm             1         Live
10.5.0.3   runtime     LoadedKernelModule   nvidia_modeset         1         Live
10.5.0.3   runtime     LoadedKernelModule   nvidia                 1         Live
```

```bash theme={null}
talosctl get extensions
```

which should produce an output similar to below:

```text theme={null}
NODE           NAMESPACE   TYPE              ID                                                                           VERSION   NAME                             VERSION
172.31.41.27   runtime     ExtensionStatus   000.ghcr.io-siderolabs-nvidia-container-toolkit-515.65.01-v1.10.0            1         nvidia-container-toolkit         515.65.01-v1.10.0
172.31.41.27   runtime     ExtensionStatus   000.ghcr.io-siderolabs-nvidia-open-gpu-kernel-modules-515.65.01-v1.2.0       1         nvidia-open-gpu-kernel-modules   515.65.01-v1.2.0
```

## Deploying NVIDIA device plugin

First we need to create the `RuntimeClass`

Apply the following manifest to create a runtime class that uses the extension:

```yaml theme={null}
---
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: nvidia
handler: nvidia
```

Install the NVIDIA device plugin:

```bash theme={null}
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update
helm install nvidia-device-plugin nvdp/nvidia-device-plugin --version=0.13.0 --set=runtimeClassName=nvidia
```

## (Optional) Setting the default runtime class as `nvidia`

> Do note that this will set the default runtime class to `nvidia` for all pods scheduled on the node.

Create a patch yaml `nvidia-default-runtimeclass.yaml` to update the machine config similar to below:

```yaml theme={null}
- op: add
  path: /machine/files
  value:
    - content: |
        [plugins]
          [plugins."io.containerd.cri.v1.runtime"]
            [plugins."io.containerd.cri.v1.runtime".containerd]
              default_runtime_name = "nvidia"
      path: /etc/cri/conf.d/20-customization.part
      op: create
```

Now apply the patch to all Talos nodes in the cluster having NVIDIA GPU's installed:

```bash theme={null}
talosctl patch mc --patch @nvidia-default-runtimeclass.yaml
```

### Testing the runtime class

> Note the `spec.runtimeClassName` being explicitly set to `nvidia` in the pod spec.

Run the following command to test the runtime class:

```bash theme={null}
kubectl run \
  nvidia-test \
  --restart=Never \
  -ti --rm \
  --image nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0 \
  --overrides '{"spec": {"runtimeClassName": "nvidia"}}' \
  nvidia-smi
```

## Collecting NVIDIA GPU debug data

When debugging NVIDIA GPU issues (for example, `NVRM: GPU has fallen off the bus` messages in the kernel log), NVIDIA support will often ask for the output of `nvidia-bug-report.sh`.

Talos does not allow direct shell access on the nodes, but you can still generate this report by using `kubectl debug`. To do this:

1. Start a debug pod on the affected node:

```bash theme={null}
# Replace ${USER} with any unique suffix if needed
kubectl run "node-debugger-${USER}" \
  --restart=Never \
  --namespace kube-system \
  --image nvcr.io/nvidia/cuda:12.8.0-base-ubuntu22.04
```

2. Then attach to it with the sysadmin debug profile:

```bash theme={null}
kubectl debug "node-debugger-${USER}" \
  -it \
  --namespace kube-system \
  --profile sysadmin
```

This will drop you into a shell inside a container running on the node.

3. Inside the debug container, download the NVIDIA driver bundle and extract `nvidia-bug-report.sh` by running the following commands:

a. Confirm the driver version talos is using:

```bash theme={null}
nvidia-smi --query-gpu=driver_version --format=csv,noheader
```

b. Set the driver version and node architecture in variables:

```bash theme={null}
DRIVER_VERSION=<nvidia-driver-version>
ARCH=<node-architecture>
```

Replace the placeholders `<nvidia-driver-version>` and `<node-architecture>` with your actual values:

* `<nvidia-driver-version>`: The nvidia driver version running on your talos nodes, which you found in step 3a.
* `<node-architecture>`: The architecture of the node.

c. Download the nvidia driver bundle, and extract the bug report script:

```bash theme={null}
apt-get update && apt-get install -y curl
curl -O "https://us.download.nvidia.com/XFree86/Linux-${ARCH}/${DRIVER_VERSION}/NVIDIA-Linux-${ARCH}-${DRIVER_VERSION}.run"

sh NVIDIA-Linux-${ARCH}-${DRIVER_VERSION}.run --extract-only
cp NVIDIA-Linux-${ARCH}-${DRIVER_VERSION}/nvidia-bug-report.sh /usr/bin/
```

3. Run `nvidia-bug-report.sh`:

```bash theme={null}
nvidia-bug-report.sh
```

This will generate `nvidia-bug-report.log.gz` in the current directory.

4. To copy the report of the cluster:

a. First, find the name of the debug container (if needed):

```bash theme={null}
kubectl get pod "node-debugger-${USER}" \
--namespace kube-system \
-o jsonpath='{.spec.containers[*].name}'
```

b. Then, from your workstation:

```bash theme={null}
kubectl cp \
"kube-system/node-debugger-${USER}:/nvidia-bug-report.log.gz" \
./nvidia-bug-report.log.gz
```

You can now upload `nvidia-bug-report.log.gz` to NVIDIA support.
