Documentation Index
Fetch the complete documentation index at: https://docs.siderolabs.com/llms.txt
Use this file to discover all available pages before exploring further.
Enabling NVIDIA GPU support on Talos is bound by NVIDIA EULA. The Talos published NVIDIA drivers are bound to a specific Talos release. The extensions versions also needs to be updated when upgrading Talos.We will be using the following NVIDIA system extensions:
nonfree-kmod-nvidianvidia-container-toolkit
To build a NVIDIA driver version not published by SideroLabs follow the instructions hereCreate the boot assets which includes the system extensions mentioned above (or create a custom installer and perform a machine upgrade if Talos is already installed).
Make sure the driver version matches for both thenonfree-kmod-nvidiaandnvidia-container-toolkitextensions. Thenonfree-kmod-nvidiaextension is versioned as<nvidia-driver-version>-<talos-release-version>and thenvidia-container-toolkitextension is versioned as<nvidia-driver-version>-<nvidia-container-toolkit-version>.
Proprietary vs OSS Nvidia Driver Support
The NVIDIA Linux GPU Driver contains several kernel modules:nvidia.ko, nvidia-modeset.ko, nvidia-uvm.ko, nvidia-drm.ko, and nvidia-peermem.ko.
Two “flavors” of these kernel modules are provided, and both are available for use within Talos:
- Proprietary, This is the flavor that NVIDIA has historically shipped.
- Open, i.e. source-published/OSS, kernel modules that are dual licensed MIT/GPLv2. With every driver release, the source code to the open kernel modules is published on https://github.com/NVIDIA/open-gpu-kernel-modules and a tarball is provided on https://download.nvidia.com/XFree86/.
Enabling the NVIDIA modules and the system extension
Patch Talos machine configuration using the patchgpu-worker-patch.yaml:
Deploying NVIDIA device plugin
First we need to create theRuntimeClass
Apply the following manifest to create a runtime class that uses the extension:
(Optional) Setting the default runtime class as nvidia
Do note that this will set the default runtime class to nvidia for all pods scheduled on the node.
Create a patch yaml nvidia-default-runtimeclass.yaml to update the machine config similar to below:
Testing the runtime class
Note theRun the following command to test the runtime class:spec.runtimeClassNamebeing explicitly set tonvidiain the pod spec.