Enabling NVIDIA GPU support on Talos is bound by NVIDIA EULA. The Talos published NVIDIA drivers are bound to a specific Talos release. The extensions versions also needs to be updated when upgrading Talos.We will be using the following NVIDIA system extensions:
nonfree-kmod-nvidianvidia-container-toolkit
To build a NVIDIA driver version not published by SideroLabs follow the instructions hereCreate the boot assets which includes the system extensions mentioned above (or create a custom installer and perform a machine upgrade if Talos is already installed).
Make sure the driver version matches for both thenonfree-kmod-nvidiaandnvidia-container-toolkitextensions. Thenonfree-kmod-nvidiaextension is versioned as<nvidia-driver-version>-<talos-release-version>and thenvidia-container-toolkitextension is versioned as<nvidia-driver-version>-<nvidia-container-toolkit-version>.
Enabling the NVIDIA modules and the system extension
Patch Talos machine configuration using the patchgpu-worker-patch.yaml:
Deploying NVIDIA device plugin
First we need to create theRuntimeClass
Apply the following manifest to create a runtime class that uses the extension:
(Optional) Setting the default runtime class as nvidia
Do note that this will set the default runtime class to nvidia for all pods scheduled on the node.
Create a patch yaml nvidia-default-runtimeclass.yaml to update the machine config similar to below:
Testing the runtime class
Note theRun the following command to test the runtime class:spec.runtimeClassNamebeing explicitly set tonvidiain the pod spec.