Enabling NVIDIA GPU support on Talos is bound by NVIDIA EULA. The Talos published NVIDIA drivers are bound to a specific Talos release. The extensions versions also needs to be updated when upgrading Talos.We will be using the following NVIDIA system extensions:
nonfree-kmod-nvidianvidia-container-toolkit
To build a NVIDIA driver version not published by SideroLabs follow the instructions hereCreate the boot assets which includes the system extensions mentioned above (or create a custom installer and perform a machine upgrade if Talos is already installed).
Make sure the driver version matches for both thenonfree-kmod-nvidiaandnvidia-container-toolkitextensions. Thenonfree-kmod-nvidiaextension is versioned as<nvidia-driver-version>-<talos-release-version>and thenvidia-container-toolkitextension is versioned as<nvidia-driver-version>-<nvidia-container-toolkit-version>.
Proprietary vs OSS Nvidia driver support
The NVIDIA Linux GPU Driver contains several kernel modules:nvidia.ko, nvidia-modeset.ko, nvidia-uvm.ko, nvidia-drm.ko, and nvidia-peermem.ko.
Two “flavors” of these kernel modules are provided, and both are available for use within Talos:
- Proprietary, This is the flavor that NVIDIA has historically shipped.
- Open, i.e. source-published/OSS, kernel modules that are dual licensed MIT/GPLv2. With every driver release, the source code to the open kernel modules is published on https://github.com/NVIDIA/open-gpu-kernel-modules and a tarball is provided on https://download.nvidia.com/XFree86/.
Enabling the NVIDIA modules and the system extension
Patch Talos machine configuration using the patchgpu-worker-patch.yaml:
Deploying NVIDIA GPU Operator
Follow the upstream instructions with only passing Helm chart values specific to Talos. Disable the driver and toolkit components of the GPU operator since we have already enabled them as system extensions on Talos. Further custom values may be needed to be passed to the helm chart depending on the cluster configuration.nvidia which can be used to schedule GPU workloads.
Testing the runtime class
Note theRun the following command to test the runtime class:spec.runtimeClassNamebeing explicitly set tonvidiain the pod spec.
Collecting NVIDIA GPU debug data
When debugging NVIDIA GPU issues (for example,NVRM: GPU has fallen off the bus messages in the kernel log), NVIDIA support will often ask for the output of nvidia-bug-report.sh.
Talos does not allow direct shell access on the nodes, but you can still generate this report by following the steps below:
- Start a debug pod on the affected node:
- This will drop you into a shell inside a container running on the node. From here, you can install the necessary tools to run
nvidia-bug-report.shand generate the report.
- Inside the debug container, run
nvidia-bug-report.sh:
nvidia-bug-report.log.gz in the current directory.
- To copy the report of the cluster:
nvidia-bug-report.log.gz to NVIDIA support.