> ## Documentation Index
> Fetch the complete documentation index at: https://docs.siderolabs.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Autoscale Your Cluster with Cluster AutoScaler in AWS

> Configure Cluster Autoscaler for Talos Linux clusters running on AWS using Omni

export const version = 'v1.13';

This guide shows you how to enable automatic scaling for your Talos Linux cluster on AWS using Cluster Autoscaler and Omni.

## Prerequisites

Before you begin you must have:

* AWS CLI configured
* `kubectl`, `talosctl`, and `helm` installed

## Step 1: Create IAM role for Cluster Autoscaler

Cluster Autoscaler uses the IAM role attached to the EC2 instances where it runs.

In this guide, the Cluster Autoscaler will be configured to run on the control plane nodes, so the IAM role must be attached to the control plane once its created.

To create the IAM role and attach it to your control plane machines, you need:

* An IAM policy that defines the permissions required by Cluster Autoscaler
* An IAM role that uses the policy
* An instance profile that allows EC2 instances to assume the IAM role

### 1.1: Define environment variables

First, define the variables used throughout the IAM setup:

```bash theme={null}
CLUSTER_NAME=cluster-autoscaler-aws
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)

AUTOSCALER_ROLE_NAME="${CLUSTER_NAME}-autoscaler-role"
AUTOSCALER_POLICY_NAME="${CLUSTER_NAME}-ClusterAutoscalerPolicy"
AUTOSCALER_INSTANCE_PROFILE_NAME="${CLUSTER_NAME}-autoscaler-instance-profile"
```

### 1.2: Create IAM policy

Next, create an IAM policy that grants Cluster Autoscaler permission to:

* Adjust Auto Scaling Group capacity
* Discover tagged node groups
* Describe EC2 and ASG resources

The policy is scoped using AWS resource tags so it only manages Auto Scaling Groups associated with this cluster.

```bash theme={null}
cat <<EOF | aws iam create-policy \
  --policy-name $AUTOSCALER_POLICY_NAME \
  --policy-document file:///dev/stdin
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "autoscaling:SetDesiredCapacity",
        "autoscaling:TerminateInstanceInAutoScalingGroup"
      ],
      "Resource": "*",
      "Condition": {
        "StringEquals": {
          "aws:ResourceTag/k8s.io/cluster-autoscaler/enabled": "true",
          "aws:ResourceTag/k8s.io/cluster-autoscaler/$CLUSTER_NAME": "true"
        }
      }
    },
    {
      "Effect": "Allow",
      "Action": [
        "autoscaling:DescribeAutoScalingGroups",
        "autoscaling:DescribeAutoScalingInstances",
        "autoscaling:DescribeScalingActivities",
        "autoscaling:DescribeTags",
        "ec2:DescribeInstances",
        "ec2:DescribeLaunchTemplateVersions",
        "ec2:DescribeInstanceTypes",
        "ec2:DescribeImages"
      ],
      "Resource": "*"
    }
  ]
}
EOF
```

### 1.3: Create IAM role and instance profile

First, create a trust policy that allows EC2 instances to assume the role:

```bash theme={null}
cat <<EOF > trust-policy.json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": { "Service": "ec2.amazonaws.com" },
      "Action": "sts:AssumeRole"
    }
  ]
}
EOF
```

Create the IAM role using the trust policy:

```bash theme={null}
aws iam create-role \
  --role-name $AUTOSCALER_ROLE_NAME \
  --assume-role-policy-document file://trust-policy.json
```

Attach the Cluster Autoscaler policy to the IAM role:

```bash theme={null}
aws iam attach-role-policy \
  --role-name $AUTOSCALER_ROLE_NAME \
  --policy-arn arn:aws:iam::$ACCOUNT_ID:policy/$AUTOSCALER_POLICY_NAME
```

Now create an instance profile so the role can be associated with EC2 instances:

```bash theme={null}
aws iam create-instance-profile \
  --instance-profile-name $AUTOSCALER_INSTANCE_PROFILE_NAME

aws iam add-role-to-instance-profile \
  --instance-profile-name $AUTOSCALER_INSTANCE_PROFILE_NAME \
  --role-name $AUTOSCALER_ROLE_NAME

echo "Waiting for IAM instance profile propagation..."
sleep 20
```

## Step 2: Launch control plane

With IAM configured, you can now launch the control plane machines.

These control plane instances are not managed by an Auto Scaling Group. They are created manually and will run Cluster Autoscaler.

### 2.1: Define environment variables

Start by defining the AWS region, Talos version, architecture, instance type, and the number of control plane machines to create.

<Note>For high availability, we recommend creating three control plane machines.</Note>

```bash theme={null}
AWS_REGION=$(aws configure get region)
TALOS_VERSION=v1.12.4
ARCH=amd64
INSTANCE_TYPE=t3.small
CONTROL_PLANE_NO=3
```

### 2.2: Retrieve the official Talos AMI

Fetch the Talos AWS AMI for your region and architecture from the official Talos release metadata.

If you need to customize your AMI—for example, by adding custom labels or extensions, you must create your own AMI and bake those customizations into it. For more information, refer to the [Register AWS Machines in Omni](../../omni-cluster-setup/registering-machines/how-to-register-an-aws-ec2-instance) documentation.

```bash theme={null}
AMI=$(curl -sL https://github.com/siderolabs/talos/releases/download/${TALOS_VERSION}/cloud-images.json \
  | jq -r '.[] | select(.region == "'"$AWS_REGION"'") | select(.arch == "'"$ARCH"'") | .id')

echo "Using AMI: $AMI"
```

### 2.3: Generate control plane join configuration

Generate the join configuration that registers the Talos nodes with Omni on boot. Encode it for use as EC2 user data:

```bash theme={null}
USER_DATA=$(omnictl jointoken machine-config)
USER_DATA_B64=$(echo "$USER_DATA" | base64)
```

### 2.4: Launch three control plane instances

Launch the control plane EC2 instances using:

* The Talos AMI
* The IAM instance profile created in **Step 1**
* The join configuration as user data

```bash theme={null}
aws ec2 run-instances \
  --region $AWS_REGION \
  --image-id $AMI \
  --instance-type $INSTANCE_TYPE \
  --count $CONTROL_PLANE_NO  \
  --iam-instance-profile Name=$AUTOSCALER_INSTANCE_PROFILE_NAME \
  --user-data "$USER_DATA" \
  --tag-specifications 'ResourceType=instance,Tags=[{Key=role,Value=autoscaler-controlplane-machine}]'
```

After the instances are launched, they will appear under Machines in the Omni dashboard. From there, you can assign them to a cluster.

<Note>We do not recommend horizontally autoscaling control plane machines. If your control plane needs more capacity, scale vertically instead.</Note>

## Step 3: Create Machine Classes

A Machine Class defines a pool of infrastructure that Omni can use when creating cluster nodes. In this step, you’ll create separate Machine Classes for the control plane and worker nodes.

### 3.1: Create the control plane Machine Class

To define a Machine Class for your control plane nodes:

1. Create the control plane machine class definition:

```bash theme={null}
cat <<EOF > controlplane-machine-class.yaml
metadata:
  namespace: default
  type: MachineClasses.omni.sidero.dev
  id: cluster-autoscaler-controlplane
spec:
  matchlabels:
    - omni.sidero.dev/platform = aws  # Change the label to match your machine
EOF
```

This command creates a Machine Class named `cluster-autoscaler-controlplane` that matches machines labeled `omni.sidero.dev/platform = aws`.

<Note> If you are using custom labels, or prefer to create a Machine Class based on a different machine label, replace `omni.sidero.dev/platform = aws` with your preferred label. The label you specify must already exist on the machines you want this Machine Class to match. </Note>

In this example, the label corresponds to the default platform label automatically applied to machines created in AWS.

2. Apply the definition:

```bash theme={null}
omnictl apply -f controlplane-machine-class.yaml
```

3. Verify that it was created:

```bash theme={null}
omnictl get machineclasses
```

### 3.2: Create the worker Machine Class

Next, repeat the process for the worker nodes:

1. Create the worker machine class definition::

```bash theme={null}
cat <<EOF > worker-machine-class.yaml
metadata:
  namespace: default
  type: MachineClasses.omni.sidero.dev
  id: cluster-autoscaler-worker
spec:
  matchlabels:
    - omni.sidero.dev/platform = aws # Change the label to match your machine
EOF
```

2. Apply the definition:

```bash theme={null}
omnictl apply -f worker-machine-class.yaml
```

3. Verify:

```bash theme={null}
omnictl get machineclasses
```

## Step 4: Create the cluster

Next, create a cluster that uses the Machine Classes you defined in Step 3.

To create a cluster:

1. Run this command to create a cluster template:

```bash theme={null}
cat <<EOF > cluster-template.yaml
kind: Cluster
name: $CLUSTER_NAME
kubernetes:
  version: v1.34.1
talos:
  version: ${TALOS_VERSION}

---
kind: ControlPlane
machineClass:
  name: cluster-autoscaler-controlplane
  size: 3

---
kind: Workers
machineClass:
  name: cluster-autoscaler-worker
  size: unlimited
EOF
```

2. Apply the template:

```bash theme={null}
omnictl cluster template sync -f cluster-template.yaml
```

3. Download the cluster's `kubeconfig` once the cluster becomes healthy:

```bash theme={null}
omnictl kubeconfig -c $CLUSTER_NAME
```

4. Monitor your cluster status from your Omni dashboard or by running:

```bash theme={null}
kubectl get nodes --watch
```

## Step 5: Enable KubeSpan (required for hybrid or on-prem autoscaling)

If your autoscaled worker nodes are not launched in the same private AWS network as your control plane nodes (for example, in hybrid cloud or on-prem environments), you must enable KubeSpan.

KubeSpan creates an encrypted WireGuard mesh between cluster nodes. This allows nodes running in different networks to securely discover and communicate with each other.

To enable KubeSpan, add the following patch to the `Cluster` document section of your cluster template:

```yaml theme={null}
patches:
  - name: kubespan-enabled
    inline:
      machine:
        network:
          kubespan:
            enabled: true
      cluster:
        discovery:
          enabled: true
```

Your cluster template should now look similar to this:

```yaml theme={null}
kind: Cluster
name: $CLUSTER_NAME
kubernetes:
  version: v1.34.1
talos:
  version: ${TALOS_VERSION}
patches:
  - name: kubespan-enabled
    inline:
      machine:
        network:
          kubespan:
            enabled: true
      cluster:
        discovery:
          enabled: true

---
kind: ControlPlane
machineClass:
  name: cluster-autoscaler-controlplane
  size: 3

---
kind: Workers
machineClass:
  name: cluster-autoscaler-worker
  size: unlimited
```

Re-apply the template:

```bash theme={null}
omnictl cluster template sync -f cluster-template.yaml
```

## Step 6: Create Launch Template and Auto Scaling Group (workers)

Cluster Autoscaler scales worker machines by adjusting the size of an AWS Auto Scaling Group (ASG).

To enable this, you need to create:

* A Launch Template, which defines how worker nodes are configured and launched
* An Auto Scaling Group, which uses the Launch Template to create and terminate worker nodes
* Tags, which allow Cluster Autoscaler to automatically discover and manage the Auto Scaling Group

The commands in this section will use your Talos worker AMI and AWS networking configuration to create these resources.

### 6.1: Create Launch Template

The Launch Template defines which AMI and instance type your worker machines will use:

```bash theme={null}
LAUNCH_TEMPLATE_NAME="talos-ca-launch-template"
AUTO_SCALING_GROUP_NAME="talos-ca-asg"

aws ec2 create-launch-template \
  --launch-template-name $LAUNCH_TEMPLATE_NAME \
  --launch-template-data "{
    \"ImageId\":\"$AMI\",
    \"InstanceType\":\"$INSTANCE_TYPE\",
    \"IamInstanceProfile\": {
      \"Name\": \"$AUTOSCALER_INSTANCE_PROFILE_NAME\"
    },
    \"UserData\":\"$USER_DATA_B64\"
  }"
```

### 6.2: Create Auto Scaling Group

Run this command to create a autoscaling group:

```bash theme={null}
VPC_ID=$(aws ec2 describe-instances \
  --filters Name=tag:role,Values=autoscaler-controlplane-machine \
  --query "Reservations[*].Instances[*].VpcId" \
  --output text)

SUBNET_IDS=$(aws ec2 describe-subnets \
  --filters Name=vpc-id,Values=$VPC_ID \
  --query 'Subnets[*].SubnetId' \
  --output text | tr '\t' ',')

aws autoscaling create-auto-scaling-group \
  --auto-scaling-group-name $AUTO_SCALING_GROUP_NAME \
  --launch-template LaunchTemplateName=$LAUNCH_TEMPLATE_NAME \
  --min-size 1 \
  --max-size 5 \
  --desired-capacity 1 \
  --vpc-zone-identifier "$SUBNET_IDS"
```

### 6.3: Tag the Auto Scaling Group for Cluster Autoscaler

These tags allow Cluster Autoscaler to discover and manage the node group:

```bash theme={null}
aws autoscaling create-or-update-tags \
  --tags \
  ResourceId=$AUTO_SCALING_GROUP_NAME,ResourceType=auto-scaling-group,Key=k8s.io/cluster-autoscaler/enabled,Value=true,PropagateAtLaunch=true \
  ResourceId=$AUTO_SCALING_GROUP_NAME,ResourceType=auto-scaling-group,Key=k8s.io/cluster-autoscaler/$CLUSTER_NAME,Value=true,PropagateAtLaunch=true
```

### 6.4: Verify the Auto Scaling Group created a worker node

Once the Auto Scaling Group is created, it would automatically launch one worker machine to match its desired capacity.

To confirm AWS created an instance:

```bash theme={null}
aws autoscaling describe-auto-scaling-groups \
  --auto-scaling-group-names $AUTO_SCALING_GROUP_NAME \
  --query 'AutoScalingGroups[0].Instances[*].InstanceId' \
  --output table
```

Then verify that the node joins your Kubernetes cluster:

```bash theme={null}
kubectl get nodes --watch
```

## Step 7: Install Cluster Autoscaler

Cluster Autoscaler runs as a Kubernetes Deployment inside your cluster. It continuously monitors unscheduled pods and adjusts your Auto Scaling Group capacity when additional nodes are required.

Run this to install Cluster Autoscaler using Helm and configure it to automatically discover and manage your AWS Auto Scaling Groups.

```bash theme={null}
helm repo add autoscaler https://kubernetes.github.io/autoscaler
helm repo update

helm install cluster-autoscaler autoscaler/cluster-autoscaler \
  -n kube-system \
  --set cloudProvider=aws \
  --set awsRegion=$AWS_REGION \
  --set autoDiscovery.clusterName=$CLUSTER_NAME \
  --set rbac.create=true \
  --set nodeSelector."node-role\.kubernetes\.io/control-plane"="" \
  --set "tolerations[0].key=node-role.kubernetes.io/control-plane" \
  --set "tolerations[0].operator=Exists" \
  --set "tolerations[0].effect=NoSchedule"
```

## Step 8: Verify Cluster Autoscaler is working

Confirm that the Cluster Autoscaler pod is running:

```bash theme={null}
kubectl -n kube-system get pods \
  -l "app.kubernetes.io/instance=cluster-autoscaler"
```

## Step 9: Test automatic scaling

Deploy a workload that requires additional capacity:

```yaml theme={null}
cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: autoscaler-demo
spec:
  replicas: 0
  selector:
    matchLabels:
      app: autoscaler-demo
  template:
    metadata:
      labels:
        app: autoscaler-demo
    spec:
      containers:
      - name: app
        image: nginx
        resources:
          requests:
            cpu: "500m"
            memory: "1Gi"
EOF
```

Scale the deployment to trigger node provisioning:

```bash theme={null}
kubectl scale deployment autoscaler-demo --replicas=10
```

Watch scaling activity:

```bash theme={null}
kubectl get pods -w
kubectl get nodes -w
```

You should observe:

* Pods entering `Pending` state
* Cluster Autoscaler increasing Auto Scaling Group capacity
* New worker nodes joining the cluster
* Pods transitioning to Running

## Cleanup

Cleanup resources created in this guide:

**Delete the test workload:**

```bash theme={null}
kubectl delete deployment autoscaler-demo
```

**Uninstall Cluster Autoscaler:**

```bash theme={null}
helm uninstall cluster-autoscaler -n kube-system
```

**Delete the Auto Scaling Group (workers):**

Set desired capacity to 0 and delete the ASG:

```bash theme={null}
aws autoscaling update-auto-scaling-group \
  --auto-scaling-group-name $AUTO_SCALING_GROUP_NAME \
  --min-size 0 \
  --desired-capacity 0

aws autoscaling delete-auto-scaling-group \
  --auto-scaling-group-name $AUTO_SCALING_GROUP_NAME \
  --force-delete
```

**Delete the Launch Template:**

```bash theme={null}
aws ec2 delete-launch-template \
  --launch-template-name $LAUNCH_TEMPLATE_NAME
```

**Delete the Omni cluster:**

```bash theme={null}
omnictl cluster delete $CLUSTER_NAME
```

Wait until the cluster and machines are removed from the Omni dashboard.

**Terminate control plane instances:**

```bash theme={null}
aws ec2 terminate-instances --instance-ids $(aws ec2 describe-instances \
  --filters Name=tag:role,Values=autoscaler-controlplane-machine \
  --query 'Reservations[*].Instances[*].InstanceId' \
  --output text)
```

**Delete Machine Classes:**

```bash theme={null}
omnictl delete machineclass cluster-autoscaler-controlplane
omnictl delete machineclass cluster-autoscaler-worker
```

**Detach the policy from the role:**

```bash theme={null}
aws iam detach-role-policy \
  --role-name $AUTOSCALER_ROLE_NAME \
  --policy-arn arn:aws:iam::$ACCOUNT_ID:policy/$AUTOSCALER_POLICY_NAME
```

**Remove the role from the instance profile:**

```bash theme={null}
aws iam remove-role-from-instance-profile \
  --instance-profile-name $AUTOSCALER_INSTANCE_PROFILE_NAME \
  --role-name $AUTOSCALER_ROLE_NAME
```

**Delete the instance profile:**

```bash theme={null}
aws iam delete-instance-profile \
  --instance-profile-name $AUTOSCALER_INSTANCE_PROFILE_NAME
```

**Delete the IAM role:**

```bash theme={null}
aws iam delete-role \
  --role-name $AUTOSCALER_ROLE_NAME
```

**Delete the IAM policy:**

```bash theme={null}
aws iam delete-policy \
  --policy-arn arn:aws:iam::$ACCOUNT_ID:policy/$AUTOSCALER_POLICY_NAME
```
