Autoscale Your Talos Cluster on AWS with Karpenter

Karpenter is a high-performance Kubernetes autoscaler that launches and terminates nodes in response to real-time workload demand. This guide walks you through integrating Karpenter with Talos Linux clusters running on AWS and managed through Omni.

Prerequisites

Before you begin you must have:

AWS CLI configured
kubectl, talosctl, and helm installed

Step 1: Create your Talos AMIs.

You will need two AMIs: one for your control plane machines and one for your worker machines. To create each AMI:

Click Download Installation Media in the Omni dashboard.
Select the AWS AMI for your architecture.
Add a Machine User Label that identifies the machine’s role. For example, for a control plane machine: role:karpenter-controlplane-machine.
Click Download to generate the AMI file.
Upload the AMI file to your AWS account. Follow the instructions in the Create Your Own AMIsguide.

Repeat the process to create the worker AMI, but use a worker-specific label (e.g. role:karpenter-worker-machine).

Note: The AMI’s final name in AWS will match the filename you upload to S3. You can use naming conventions to clearly distinguish between your control plane and worker AMIs.

Step 2: Create your machines using the AMIs

Use the control plane AMI to create your control plane machines.

Using an odd number of control plane nodes help etcd maintain quorum reliably. For high availability, we recommend creating three control plane nodes.

Step 3: Create Machine Classes

Machine Classes group machines by labels and act as the pool description for clusters. You’ll create two Machine Classes:

one for control plane machines
one for worker machines

3.1: Control plane Machine Class

To create a machine class for your control plane machines:

Create a file named controlplane-machine-class.yaml:

metadata:
  namespace: default
  type: MachineClasses.omni.sidero.dev
  id: karpenter-controlplane
spec:
  matchlabels:
    # Match the label you applied to your control plane AMI
    - role = karpenter-controlplane-machine

This creates a karpenter-controlplane Machine Class that matches machines using the control plane label you added to the AMI ( e.g., role:karpenter-controlplane-machine).

Apply it:

omnictl apply -f controlplane-machine-class.yaml

Verify it exists:

omnictl get machineclasses

3.2: Worker Machine Class

Repeat the process for the worker machines:

Create a file named worker-machine-class.yaml:

metadata:
  namespace: default
  type: MachineClasses.omni.sidero.dev
  id: karpenter-worker
spec:
  matchlabels:
    # Match the label you applied to your worker AMI
    - role = karpenter-worker-machine

Apply it:

omnictl apply -f worker-machine-class.yaml

Verify:

omnictl get machineclasses

Step 4: Create a cluster

Next, create a cluster using cluster templates:

Create a file named cluster-template.yaml and paste the following YAML, updating the placeholders as needed:

# cluster-template.yaml

kind: Cluster
name: <cluster-name> # Replace with your cluster name
kubernetes:
  version: v1.34.1 # Replace this version with your preferred version of Kubernetes
talos:
  version: v1.11.3 # Replace this version with your preferred version of Kubernetes

---
kind: ControlPlane
machineClass:
  name: <name of your control plane machine class>
  size: <number of control plane machines> 

---
kind: Workers
machineClass:
  name: <name of your worker machine class>
  size: Unlimited

This template creates a cluster and assigns machines to the appropriate Machine Classes. Replace the following placeholders:

<cluster-name>— the name you want to give your cluster
<name of your control plane machine class>— the Machine Class for your control plane nodes.
<number of control plane machines>— how many control plane nodes your cluster should have.
<name of your worker plane machine class>— the Machine Class for your worker nodes.

Apply the cluster-template.yaml to create the cluster:

omnictl cluster template sync -f cluster-template.yaml

Once your cluster is running, download and export its kubeconfig so you can interact with it. Replace <cluster-name> with the name of your cluster:

omnictl kubeconfig -c <cluster-name>

Verify that all machines are running:

kubectl get nodes

Step 5: Define all variables

Set the environment variables you will use throughout this guide:

export AWS_ACCOUNT_ID="$(aws sts get-caller-identity --query Account --output text)"
export CLUSTER_NAME="<cluster-name>"        # Your cluster name
export AWS_REGION="<aws-region>"            # Example: eu-west-1
export TALOS_AMI_ID="<worker-ami-id>"       # Example: ami-0123456789abcdef0

# Subnets where Karpenter is allowed to launch worker nodes
export SUBNET_IDS="subnet-xxxx"

# Security group that will be attached to all worker nodes launched by Karpenter
export SECURITY_GROUP_IDS="sg-xxxx"

export KARPENTER_NAMESPACE="kube-system"
export KARPENTER_VERSION="1.8.1"

export CLUSTER_ENDPOINT="https://<your-endpoint>.omni.siderolabs.io"  # Your Omni API endpoint

Step 6: Tag subnets and security groups for Karpenter discovery

Karpenter can only launch nodes into subnets and security groups that you explicitly mark for discovery.
You do this by adding a karpenter.sh/discovery tag to each resource. Run the following commands to tag your subnets and security groups:

aws ec2 create-tags \
  --resources $SUBNET_IDS \
  --tags Key=karpenter.sh/discovery,Value=$CLUSTER_NAME

aws ec2 create-tags \
  --resources $SECURITY_GROUP_IDS \
  --tags Key=karpenter.sh/discovery,Value=$CLUSTER_NAME

Step 7: Fix IAM permissions on the Karpenter node

Karpenter uses the IAM role of the node where the Karpenter controller pod runs. If the node running the Karpenter pod has no IAM instance profile (which is common when using Omni), Karpenter will fail with AccessDenied errors. This step ensures your Karpenter nodes have the correct AWS permissions.

7.1: Identify the Karpenter node

First, retrieve your control plane nodes and their IP addresses:

kubectl get nodes -o wide

Pick the node where Karpenter will run, then set its name and IP address as environment variables:

export KARPENTER_NODE_NAME="<replace-with-node-name>"
export KARPENTER_NODE_IP="<replace-with-node-ip>"

Next, find the EC2 instance ID that corresponds to that IP:

export KARPENTER_INSTANCE_ID=$(aws ec2 describe-instances \
  --filters "Name=private-ip-address,Values=$KARPENTER_NODE_IP" \
  --query 'Reservations[0].Instances[0].InstanceId' \
  --output text)

7.2: Create an instance profile and IAM role

Create a new instance profile and IAM role for the Karpenter node:

aws iam create-instance-profile \
  --instance-profile-name talos-karpenter-profile

aws iam create-role \
  --role-name talos-karpenter-role \
  --assume-role-policy-document file://<(cat <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": { "Service": "ec2.amazonaws.com" },
      "Action": "sts:AssumeRole"
    }
  ]
}
EOF
)

Next, attach the role to the instance profile:

aws iam add-role-to-instance-profile \
  --instance-profile-name talos-karpenter-profile \
  --role-name talos-karpenter-role

Then, associate the instance profile with the worker EC2 instance:

aws ec2 associate-iam-instance-profile \
  --instance-id "$KARPENTER_INSTANCE_ID" \
  --iam-instance-profile Name="talos-karpenter-profile"

Export the values for later steps:

export INSTANCE_PROFILE_NAME="talos-karpenter-profile"
export INSTANCE_ROLE_NAME="talos-karpenter-role"

7.3: Attach required IAM permissions

Next, attach the required policies that grant Karpenter the permissions to launch, tag, and terminate EC2 instances for this cluster:

curl -fsSL https://docs.siderolabs.com/scripts/karpenter-iam.template \
  | envsubst > karpenter-policy.json

export KARPENTER_POLICY_ARN=$(aws iam create-policy \
  --policy-name talos-karpenter-controller \
  --policy-document file://karpenter-policy.json \
  --query 'Policy.Arn' \
  --output text)

Then attach these policies to the Karpenter role:

aws iam attach-role-policy \
  --role-name "$INSTANCE_ROLE_NAME" \
  --policy-arn "$KARPENTER_POLICY_ARN"

Step 8: Create Karpenter interruption queue

Karpenter can respond to AWS node interruption events (such as maintenance, spot interruptions, or scheduled shutdowns). To enable this, create a simple SQS queue that Karpenter can watch. When AWS publishes an interruption event, Karpenter drains and replaces the node before it terminates.

When Karpenter (or AWS) terminates an EC2 instance, the matching node is removed from Kubernetes, but the Machine is not deleted automatically in Omni.You may still see terminated machines listed in the Omni UI and need to clean them up manually.

aws sqs create-queue \
  --queue-name "$CLUSTER_NAME" >/dev/null 2>&1 || true

export KARPENTER_QUEUE_NAME="$CLUSTER_NAME"

Step 9: Install Karpenter via Helm

At this point, Karpenter has everything it needs, the cluster endpoint and the required IAM permissions, to provision new machines. Install it using Helm:

helm install karpenter oci://public.ecr.aws/karpenter/karpenter \
  --namespace "$KARPENTER_NAMESPACE" \
  --create-namespace \
  --version "$KARPENTER_VERSION" \
  --set settings.clusterName="$CLUSTER_NAME" \
  --set settings.clusterEndpoint="$CLUSTER_ENDPOINT" \
  --set settings.aws.defaultInstanceProfile="$INSTANCE_PROFILE_NAME" \
  --set settings.aws.interruptionQueueName="$KARPENTER_QUEUE_NAME" \
  --set tolerations [{"key": "node-role.kubernetes.io/control-plane","operator":"Exists","effect":"NoSchedule"}]'

Add the topology spread label required by Karpenter:

kubectl label node $KARPENTER_NODE_NAME topology.kubernetes.io/zone=zone-1

Confirm that the controller is running:

kubectl -n kube-system get pods -l app.kubernetes.io/name=karpenter

Step 10: Create an EC2NodeClass

The EC2NodeClass tells Karpenter how to launch Talos worker machines: which AMI to use, which subnets and security groups to join, and which IAM instance profile to attach.

cat <<EOF | envsubst | kubectl apply -f -
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: talos-workers
spec:
  amiFamily: Custom
  amiSelectorTerms:
    - id: $TALOS_AMI_ID
  instanceProfile: $INSTANCE_PROFILE_NAME

  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: $CLUSTER_NAME
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: $CLUSTER_NAME

  tags:
    karpenter.sh/discovery: $CLUSTER_NAME
EOF

kubectl get ec2nodeclass

Step 11: Create a NodePool

The NodePool defines what Karpenter is allowed to provision. This includes limits, disruption behavior, labels, and instance-type requirements.

cat <<EOF | kubectl apply -f -
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: talos-default
spec:
  limits:
    cpu: "12" # replace with your limits
    memory: "24Gi" # replace with your limits

  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 2m

  template:
    metadata:
      labels:
        node.kubernetes.io/role: worker 

    spec:
      nodeClassRef:
        group: karpenter.k8s.aws 
        kind: EC2NodeClass
        name: talos-workers

      requirements:
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64"]

        - key: karpenter.sh/capacity-type
          operator: In
          values: ["on-demand"]
EOF

kubectl get nodepool

Step 12: Deploy a workload that triggers autoscaling

Now deploy a simple workload that Karpenter can scale against:

cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: karpenter-demo
spec:
  replicas: 0
  selector:
    matchLabels:
      app: karpenter-demo
  template:
    metadata:
      labels:
        app: karpenter-demo
    spec:
      containers:
        - name: pause
          image: registry.k8s.io/pause:3.9
          resources:
            requests:
              cpu: "500m"
              memory: "512Mi"
EOF

Check workload exists:

kubectl get deploy karpenter-demo

Step 13: Trigger autoscaling

Increase the number of replicas to create pending pods and trigger Karpenter to provision new nodes:

kubectl scale deploy karpenter-demo --replicas=10

Watch autoscaling:

kubectl get pods -o wide
kubectl get nodeclaims
kubectl get nodes -o wide

You can also watch the new machines appear in the Omni dashboard as Karpenter provisions them and they join the cluster.

Cleanup

Delete the resources created in this guide:

First, remove the demo deployment and Karpenter CRDs:

kubectl scale deploy karpenter-demo --replicas=0
kubectl delete deploy karpenter-demo

kubectl delete nodepool talos-default
kubectl delete ec2nodeclass talos-workers

Uninstall Karpenter:

helm uninstall karpenter -n "$KARPENTER_NAMESPACE"

Delete the interuption queue:

aws sqs delete-queue \
  --queue-url "$(aws sqs get-queue-url \
    --queue-name "$KARPENTER_QUEUE_NAME" \
    --query 'QueueUrl' \
    --output text)"

Detach and delete the IAM policy:

aws iam detach-role-policy \
  --role-name "$INSTANCE_ROLE_NAME" \
  --policy-arn "$KARPENTER_POLICY_ARN"

aws iam delete-policy \
  --policy-arn "$KARPENTER_POLICY_ARN"

Remove and delete the instance profile and role:

aws iam remove-role-from-instance-profile \
  --instance-profile-name "$INSTANCE_PROFILE_NAME" \
  --role-name "$INSTANCE_ROLE_NAME"

aws iam delete-role \
  --role-name "$INSTANCE_ROLE_NAME"

aws iam delete-instance-profile \
  --instance-profile-name "$INSTANCE_PROFILE_NAME"

Overview

Getting Started

Self Hosted

Infrastructure and Extensions

Omni Cluster Setup

Cluster Management

Security and Authentication

Reference

Troubleshooting and Support

Autoscale Your Talos Cluster on AWS with Karpenter

Prerequisites

Step 1: Create your Talos AMIs.

Step 2: Create your machines using the AMIs

Step 3: Create Machine Classes

3.1: Control plane Machine Class

3.2: Worker Machine Class

Step 4: Create a cluster

Step 5: Define all variables

Step 6: Tag subnets and security groups for Karpenter discovery

Step 7: Fix IAM permissions on the Karpenter node

7.1: Identify the Karpenter node

7.2: Create an instance profile and IAM role

7.3: Attach required IAM permissions

Step 8: Create Karpenter interruption queue

Step 9: Install Karpenter via Helm

Step 10: Create an EC2NodeClass

Step 11: Create a NodePool

Step 12: Deploy a workload that triggers autoscaling

Step 13: Trigger autoscaling

Cleanup

Overview

Getting Started

Self Hosted

Infrastructure and Extensions

Omni Cluster Setup

Cluster Management

Security and Authentication

Reference

Troubleshooting and Support

​Prerequisites

​Step 1: Create your Talos AMIs.

​Step 2: Create your machines using the AMIs

​Step 3: Create Machine Classes

​3.1: Control plane Machine Class

​3.2: Worker Machine Class

​Step 4: Create a cluster

​Step 5: Define all variables

​Step 6: Tag subnets and security groups for Karpenter discovery

​Step 7: Fix IAM permissions on the Karpenter node

​7.1: Identify the Karpenter node

​7.2: Create an instance profile and IAM role

​7.3: Attach required IAM permissions

​Step 8: Create Karpenter interruption queue

​Step 9: Install Karpenter via Helm

​Step 10: Create an EC2NodeClass

​Step 11: Create a NodePool

​Step 12: Deploy a workload that triggers autoscaling

​Step 13: Trigger autoscaling

​Cleanup

Prerequisites

Step 1: Create your Talos AMIs.

Step 2: Create your machines using the AMIs

Step 3: Create Machine Classes

3.1: Control plane Machine Class

3.2: Worker Machine Class

Step 4: Create a cluster

Step 5: Define all variables

Step 6: Tag subnets and security groups for Karpenter discovery

Step 7: Fix IAM permissions on the Karpenter node

7.1: Identify the Karpenter node

7.2: Create an instance profile and IAM role

7.3: Attach required IAM permissions

Step 8: Create Karpenter interruption queue

Step 9: Install Karpenter via Helm

Step 10: Create an EC2NodeClass

Step 11: Create a NodePool

Step 12: Deploy a workload that triggers autoscaling

Step 13: Trigger autoscaling

Cleanup