> ## Documentation Index
> Fetch the complete documentation index at: https://docs.siderolabs.com/llms.txt
> Use this file to discover all available pages before exploring further.

# AMD GPU Support (ROCm)

> Enable AMD GPUs on Talos and expose them to Kubernetes using the ROCm GPU Operator.

export const VersionWarningBanner = () => {
  const latestVersion = "v1.13";
  const [latestUrl, setLatestUrl] = useState(null);
  const [currentVersion, setCurrentVersion] = useState(null);
  const [isBeta, setIsBeta] = useState(false);
  const parseVersion = v => v.replace("v", "").split(".").map(Number);
  const isGreaterVersion = (a, b) => {
    const [aMajor, aMinor] = parseVersion(a);
    const [bMajor, bMinor] = parseVersion(b);
    if (aMajor > bMajor) return true;
    if (aMajor === bMajor && aMinor > bMinor) return true;
    return false;
  };
  useEffect(() => {
    if (typeof window === "undefined") return;
    const {pathname, hash, search} = window.location;
    const match = pathname.match(/\/talos\/(v\d+\.\d+)\//);
    if (!match) return;
    const detectedVersion = match[1];
    if (detectedVersion === latestVersion) return;
    setCurrentVersion(detectedVersion);
    if (isGreaterVersion(detectedVersion, latestVersion)) {
      setIsBeta(true);
    }
    const newPath = pathname.replace(`/talos/${detectedVersion}/`, `/talos/${latestVersion}/`);
    setLatestUrl(`${newPath}${search}${hash}`);
  }, []);
  if (!latestUrl || !currentVersion) return null;
  return <div className="not-prose sticky top-6 z-50 my-6">
      <div className="border border-yellow-500/30 bg-yellow-500/10 px-4 py-3 rounded-xl">
        <div className="text-sm">
          {isBeta ? <>
              ⚠️ You are viewing a <strong>beta version</strong> of Talos ({currentVersion}).
              This version may be unstable.
              <a href={latestUrl} className="ml-2 underline text-yellow-400 hover:text-yellow-300 font-medium">
                View latest stable version {latestVersion} →
              </a>
            </> : <>
              ⚠️ You are viewing an older version of Talos ({currentVersion}).
              <a href={latestUrl} className="ml-2 underline text-yellow-400 hover:text-yellow-300 font-medium">
                View the latest version {latestVersion} →
              </a>
            </>}
        </div>
      </div>
    </div>;
};

<VersionWarningBanner />

Talos supports AMD GPUs by loading the standard Linux `amdgpu` driver at boot.

To make those GPUs available to Kubernetes workloads, you can deploy the ROCm GPU Operator.

This guide shows how to enable AMD GPU support on your Talos nodes, apply any tuning your hardware might need, and install ROCm inside your cluster.

## Before you begin

You’ll need:

* A Talos Linux cluster running v1.11 or later.

* At least one node with an AMD GPU.

* Basic familiarity with editing and applying Talos machine configuration.

* The following Talos system extensions:

  * siderolabs/amdgpu

  * siderolabs/amd-ucode

Most common AMD GPUs require only the standard `amdgpu` driver included with Talos.

Some newer GPUs may require additional tuning, which we’ll cover later in this guide.

## Enable AMD GPU support

Enable AMD GPU firmware and driver support by patching your worker node configuration with these system extensions configuration below:

```yaml theme={null}
machine:
  type: worker

customization:
  systemExtensions:
    officialExtensions:
      - siderolabs/amdgpu
      - siderolabs/amd-ucode
```

Once you have applied your configurations and the node reboots, the kernel will load the AMDGPU modules and make the GPU available to the operating system

## Optional: GPU tuning

Some hardware may require additional kernel arguments or memory tuning, particularly newly released or high-performance GPUs.

Example configuration for an AMD AI 395+ (Strix Halo) system:

```yaml theme={null}
customization:
  extraKernelArgs:
    - amd_iommu=off
    - amdgpu.gttsize=131072
    - ttm.pages_limit=33554432
  systemExtensions:
    officialExtensions:
      - siderolabs/amdgpu
      - siderolabs/amd-ucode
```

What these parameters do:

* `amd_iommu=off`: Disables AMD IOMMU initialization. Useful when passthrough or PCI initialization causes issues.
* `amdgpu.gttsize`: Increases the GPU GTT memory size for workloads that allocate large buffers
* `ttm.pages_limit`: Raises the TTM memory limit for large model workloads.

## Deploy the ROCm GPU operator

With GPU support enabled at the OS level, you can deploy the ROCm GPU Operator to surface GPU resources to Kubernetes workloads.

Add the ROCm Helm repository:

```bash theme={null}
helm repo add rocm https://rocm.github.io/gpu-operator
helm repo update
```

Install the operator:

```bash theme={null}
helm install rocm-gpu-operator rocm/gpu-operator -n kube-system
```

Check operator status:

```bash theme={null}
kubectl get pods -n kube-system | grep gpu
```

Inspect your node to confirm GPU resources. Replace the `<node-name>` placeholder with the name of your node:

```bash theme={null}
kubectl describe node <node-name> | grep -i gpu
```

## Troubleshooting

Issues can show up at different layers depending on your hardware, kernel version, or virtualization platform. The following sections outline the most common problems and how to diagnose them.

### GPU not detected

If the GPU does not appear:

1. Confirm that systems extensions are installed:

```bash theme={null}
talosctl get extensions --nodes <node-ip>
```

2. Review kernel logs:

```bash theme={null}
talosctl logs -k --nodes <node-ip>
```

3. Check PCI visibility:

```bash theme={null}
talosctl get devices.pci --nodes <node-ip>
```

## ROCm operator issues

If the operator fails to initialize, inspect logs:

```bash theme={null}
kubectl logs -n kube-system deployment/rocm-gpu-operator
```

Confirm that:

* System extensions are active
* The GPU is visible in Talos.
* GPU firmware matches ROCm expectations
