This guide shows you how to enable automatic scaling for your Talos Linux cluster on AWS using Cluster Autoscaler and Omni.
Prerequisites
Before you begin you must have:
- AWS CLI configured
kubectl, talosctl, and helm installed
Step 1: Create IAM role for Cluster Autoscaler
Cluster Autoscaler uses the IAM role attached to the EC2 instances where it runs.
In this guide, the Cluster Autoscaler will be configured to run on the control plane nodes, so the IAM role must be attached to the control plane once its created.
To create the IAM role and attach it to your control plane machines, you need:
- An IAM policy that defines the permissions required by Cluster Autoscaler
- An IAM role that uses the policy
- An instance profile that allows EC2 instances to assume the IAM role
1.1: Define environment variables
First, define the variables used throughout the IAM setup:
CLUSTER_NAME=cluster-autoscaler-aws
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
AUTOSCALER_ROLE_NAME="${CLUSTER_NAME}-autoscaler-role"
AUTOSCALER_POLICY_NAME="${CLUSTER_NAME}-ClusterAutoscalerPolicy"
AUTOSCALER_INSTANCE_PROFILE_NAME="${CLUSTER_NAME}-autoscaler-instance-profile"
1.2: Create IAM policy
Next, create an IAM policy that grants Cluster Autoscaler permission to:
- Adjust Auto Scaling Group capacity
- Discover tagged node groups
- Describe EC2 and ASG resources
The policy is scoped using AWS resource tags so it only manages Auto Scaling Groups associated with this cluster.
cat <<EOF | aws iam create-policy \
--policy-name $AUTOSCALER_POLICY_NAME \
--policy-document file:///dev/stdin
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"autoscaling:SetDesiredCapacity",
"autoscaling:TerminateInstanceInAutoScalingGroup"
],
"Resource": "*",
"Condition": {
"StringEquals": {
"aws:ResourceTag/k8s.io/cluster-autoscaler/enabled": "true",
"aws:ResourceTag/k8s.io/cluster-autoscaler/$CLUSTER_NAME": "true"
}
}
},
{
"Effect": "Allow",
"Action": [
"autoscaling:DescribeAutoScalingGroups",
"autoscaling:DescribeAutoScalingInstances",
"autoscaling:DescribeScalingActivities",
"autoscaling:DescribeTags",
"ec2:DescribeInstances",
"ec2:DescribeLaunchTemplateVersions",
"ec2:DescribeInstanceTypes",
"ec2:DescribeImages"
],
"Resource": "*"
}
]
}
EOF
1.3: Create IAM role and instance profile
First, create a trust policy that allows EC2 instances to assume the role:
cat <<EOF > trust-policy.json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": { "Service": "ec2.amazonaws.com" },
"Action": "sts:AssumeRole"
}
]
}
EOF
Create the IAM role using the trust policy:
aws iam create-role \
--role-name $AUTOSCALER_ROLE_NAME \
--assume-role-policy-document file://trust-policy.json
Attach the Cluster Autoscaler policy to the IAM role:
aws iam attach-role-policy \
--role-name $AUTOSCALER_ROLE_NAME \
--policy-arn arn:aws:iam::$ACCOUNT_ID:policy/$AUTOSCALER_POLICY_NAME
Now create an instance profile so the role can be associated with EC2 instances:
aws iam create-instance-profile \
--instance-profile-name $AUTOSCALER_INSTANCE_PROFILE_NAME
aws iam add-role-to-instance-profile \
--instance-profile-name $AUTOSCALER_INSTANCE_PROFILE_NAME \
--role-name $AUTOSCALER_ROLE_NAME
echo "Waiting for IAM instance profile propagation..."
sleep 20
Step 2: Launch control plane
With IAM configured, you can now launch the control plane machines.
These control plane instances are not managed by an Auto Scaling Group. They are created manually and will run Cluster Autoscaler.
2.1: Define environment variables
Start by defining the AWS region, Talos version, architecture, instance type, and the number of control plane machines to create.
For high availability, we recommend creating three control plane machines.
AWS_REGION=$(aws configure get region)
TALOS_VERSION=v1.12.4
ARCH=amd64
INSTANCE_TYPE=t3.small
CONTROL_PLANE_NO=3
2.2: Retrieve the official Talos AMI
Fetch the Talos AWS AMI for your region and architecture from the official Talos release metadata.
If you need to customize your AMI—for example, by adding custom labels or extensions, you must create your own AMI and bake those customizations into it. For more information, refer to the Register AWS Machines in Omni documentation.
AMI=$(curl -sL https://github.com/siderolabs/talos/releases/download/${TALOS_VERSION}/cloud-images.json \
| jq -r '.[] | select(.region == "'"$AWS_REGION"'") | select(.arch == "'"$ARCH"'") | .id')
echo "Using AMI: $AMI"
2.3: Generate control plane join configuration
Generate the join configuration that registers the Talos nodes with Omni on boot. Encode it for use as EC2 user data:
USER_DATA=$(omnictl jointoken machine-config)
USER_DATA_B64=$(echo "$USER_DATA" | base64)
2.4: Launch three control plane instances
Launch the control plane EC2 instances using:
- The Talos AMI
- The IAM instance profile created in Step 1
- The join configuration as user data
aws ec2 run-instances \
--region $AWS_REGION \
--image-id $AMI \
--instance-type $INSTANCE_TYPE \
--count $CONTROL_PLANE_NO \
--iam-instance-profile Name=$AUTOSCALER_INSTANCE_PROFILE_NAME \
--user-data "$USER_DATA" \
--tag-specifications 'ResourceType=instance,Tags=[{Key=role,Value=autoscaler-controlplane-machine}]'
After the instances are launched, they will appear under Machines in the Omni dashboard. From there, you can assign them to a cluster.
We do not recommend horizontally autoscaling control plane machines. If your control plane needs more capacity, scale vertically instead.
Step 3: Create Machine Classes
A Machine Class defines a pool of infrastructure that Omni can use when creating cluster nodes. In this step, you’ll create separate Machine Classes for the control plane and worker nodes.
3.1: Create the control plane Machine Class
To define a Machine Class for your control plane nodes:
- Create the control plane machine class definition:
cat <<EOF > controlplane-machine-class.yaml
metadata:
namespace: default
type: MachineClasses.omni.sidero.dev
id: cluster-autoscaler-controlplane
spec:
matchlabels:
- omni.sidero.dev/platform = aws # Change the label to match your machine
EOF
This command creates a Machine Class named cluster-autoscaler-controlplane that matches machines labeled omni.sidero.dev/platform = aws.
If you are using custom labels, or prefer to create a Machine Class based on a different machine label, replace omni.sidero.dev/platform = aws with your preferred label. The label you specify must already exist on the machines you want this Machine Class to match.
In this example, the label corresponds to the default platform label automatically applied to machines created in AWS.
- Apply the definition:
omnictl apply -f controlplane-machine-class.yaml
- Verify that it was created:
omnictl get machineclasses
3.2: Create the worker Machine Class
Next, repeat the process for the worker nodes:
- Create the worker machine class definition::
cat <<EOF > worker-machine-class.yaml
metadata:
namespace: default
type: MachineClasses.omni.sidero.dev
id: cluster-autoscaler-worker
spec:
matchlabels:
- omni.sidero.dev/platform = aws # Change the label to match your machine
EOF
- Apply the definition:
omnictl apply -f worker-machine-class.yaml
- Verify:
omnictl get machineclasses
Step 4: Create the cluster
Next, create a cluster that uses the Machine Classes you defined in Step 3.
To create a cluster:
- Run this command to create a cluster template:
cat <<EOF > cluster-template.yaml
kind: Cluster
name: $CLUSTER_NAME
kubernetes:
version: v1.34.1
talos:
version: ${TALOS_VERSION}
---
kind: ControlPlane
machineClass:
name: cluster-autoscaler-controlplane
size: 3
---
kind: Workers
machineClass:
name: cluster-autoscaler-worker
size: unlimited
EOF
- Apply the template:
omnictl cluster template sync -f cluster-template.yaml
- Download the cluster’s
kubeconfig once the cluster becomes healthy:
omnictl kubeconfig -c $CLUSTER_NAME
- Monitor your cluster status from your Omni dashboard or by running:
kubectl get nodes --watch
Step 5: Enable KubeSpan (required for hybrid or on-prem autoscaling)
If your autoscaled worker nodes are not launched in the same private AWS network as your control plane nodes (for example, in hybrid cloud or on-prem environments), you must enable KubeSpan.
KubeSpan creates an encrypted WireGuard mesh between cluster nodes. This allows nodes running in different networks to securely discover and communicate with each other.
To enable KubeSpan, add the following patch to the Cluster document section of your cluster template:
patches:
- name: kubespan-enabled
inline:
machine:
network:
kubespan:
enabled: true
cluster:
discovery:
enabled: true
Your cluster template should now look similar to this:
kind: Cluster
name: $CLUSTER_NAME
kubernetes:
version: v1.34.1
talos:
version: ${TALOS_VERSION}
patches:
- name: kubespan-enabled
inline:
machine:
network:
kubespan:
enabled: true
cluster:
discovery:
enabled: true
---
kind: ControlPlane
machineClass:
name: cluster-autoscaler-controlplane
size: 3
---
kind: Workers
machineClass:
name: cluster-autoscaler-worker
size: unlimited
Re-apply the template:
omnictl cluster template sync -f cluster-template.yaml
Step 6: Create Launch Template and Auto Scaling Group (workers)
Cluster Autoscaler scales worker machines by adjusting the size of an AWS Auto Scaling Group (ASG).
To enable this, you need to create:
- A Launch Template, which defines how worker nodes are configured and launched
- An Auto Scaling Group, which uses the Launch Template to create and terminate worker nodes
- Tags, which allow Cluster Autoscaler to automatically discover and manage the Auto Scaling Group
The commands in this section will use your Talos worker AMI and AWS networking configuration to create these resources.
6.1: Create Launch Template
The Launch Template defines which AMI and instance type your worker machines will use:
LAUNCH_TEMPLATE_NAME="talos-ca-launch-template"
AUTO_SCALING_GROUP_NAME="talos-ca-asg"
aws ec2 create-launch-template \
--launch-template-name $LAUNCH_TEMPLATE_NAME \
--launch-template-data "{
\"ImageId\":\"$AMI\",
\"InstanceType\":\"$INSTANCE_TYPE\",
\"IamInstanceProfile\": {
\"Name\": \"$AUTOSCALER_INSTANCE_PROFILE_NAME\"
},
\"UserData\":\"$USER_DATA_B64\"
}"
6.2: Create Auto Scaling Group
Run this command to create a autoscaling group:
VPC_ID=$(aws ec2 describe-instances \
--filters Name=tag:role,Values=autoscaler-controlplane-machine \
--query "Reservations[*].Instances[*].VpcId" \
--output text)
SUBNET_IDS=$(aws ec2 describe-subnets \
--filters Name=vpc-id,Values=$VPC_ID \
--query 'Subnets[*].SubnetId' \
--output text | tr '\t' ',')
aws autoscaling create-auto-scaling-group \
--auto-scaling-group-name $AUTO_SCALING_GROUP_NAME \
--launch-template LaunchTemplateName=$LAUNCH_TEMPLATE_NAME \
--min-size 1 \
--max-size 5 \
--desired-capacity 1 \
--vpc-zone-identifier "$SUBNET_IDS"
6.3: Tag the Auto Scaling Group for Cluster Autoscaler
These tags allow Cluster Autoscaler to discover and manage the node group:
aws autoscaling create-or-update-tags \
--tags \
ResourceId=$AUTO_SCALING_GROUP_NAME,ResourceType=auto-scaling-group,Key=k8s.io/cluster-autoscaler/enabled,Value=true,PropagateAtLaunch=true \
ResourceId=$AUTO_SCALING_GROUP_NAME,ResourceType=auto-scaling-group,Key=k8s.io/cluster-autoscaler/$CLUSTER_NAME,Value=true,PropagateAtLaunch=true
6.4: Verify the Auto Scaling Group created a worker node
Once the Auto Scaling Group is created, it would automatically launch one worker machine to match its desired capacity.
To confirm AWS created an instance:
aws autoscaling describe-auto-scaling-groups \
--auto-scaling-group-names $AUTO_SCALING_GROUP_NAME \
--query 'AutoScalingGroups[0].Instances[*].InstanceId' \
--output table
Then verify that the node joins your Kubernetes cluster:
kubectl get nodes --watch
Step 7: Install Cluster Autoscaler
Cluster Autoscaler runs as a Kubernetes Deployment inside your cluster. It continuously monitors unscheduled pods and adjusts your Auto Scaling Group capacity when additional nodes are required.
Run this to install Cluster Autoscaler using Helm and configure it to automatically discover and manage your AWS Auto Scaling Groups.
helm repo add autoscaler https://kubernetes.github.io/autoscaler
helm repo update
helm install cluster-autoscaler autoscaler/cluster-autoscaler \
-n kube-system \
--set cloudProvider=aws \
--set awsRegion=$AWS_REGION \
--set autoDiscovery.clusterName=$CLUSTER_NAME \
--set rbac.create=true \
--set nodeSelector."node-role\.kubernetes\.io/control-plane"="" \
--set "tolerations[0].key=node-role.kubernetes.io/control-plane" \
--set "tolerations[0].operator=Exists" \
--set "tolerations[0].effect=NoSchedule"
Step 8: Verify Cluster Autoscaler is working
Confirm that the Cluster Autoscaler pod is running:
kubectl -n kube-system get pods \
-l "app.kubernetes.io/instance=cluster-autoscaler"
Step 9: Test automatic scaling
Deploy a workload that requires additional capacity:
cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
name: autoscaler-demo
spec:
replicas: 0
selector:
matchLabels:
app: autoscaler-demo
template:
metadata:
labels:
app: autoscaler-demo
spec:
containers:
- name: app
image: nginx
resources:
requests:
cpu: "500m"
memory: "1Gi"
EOF
Scale the deployment to trigger node provisioning:
kubectl scale deployment autoscaler-demo --replicas=10
Watch scaling activity:
kubectl get pods -w
kubectl get nodes -w
You should observe:
- Pods entering
Pending state
- Cluster Autoscaler increasing Auto Scaling Group capacity
- New worker nodes joining the cluster
- Pods transitioning to Running
Cleanup
Cleanup resources created in this guide:
Delete the test workload:
kubectl delete deployment autoscaler-demo
Uninstall Cluster Autoscaler:
helm uninstall cluster-autoscaler -n kube-system
Delete the Auto Scaling Group (workers):
Set desired capacity to 0 and delete the ASG:
aws autoscaling update-auto-scaling-group \
--auto-scaling-group-name $AUTO_SCALING_GROUP_NAME \
--min-size 0 \
--desired-capacity 0
aws autoscaling delete-auto-scaling-group \
--auto-scaling-group-name $AUTO_SCALING_GROUP_NAME \
--force-delete
Delete the Launch Template:
aws ec2 delete-launch-template \
--launch-template-name $LAUNCH_TEMPLATE_NAME
Delete the Omni cluster:
omnictl cluster delete $CLUSTER_NAME
Wait until the cluster and machines are removed from the Omni dashboard.
Terminate control plane instances:
aws ec2 terminate-instances --instance-ids $(aws ec2 describe-instances \
--filters Name=tag:role,Values=autoscaler-controlplane-machine \
--query 'Reservations[*].Instances[*].InstanceId' \
--output text)
Delete Machine Classes:
omnictl delete machineclass cluster-autoscaler-controlplane
omnictl delete machineclass cluster-autoscaler-worker
Detach the policy from the role:
aws iam detach-role-policy \
--role-name $AUTOSCALER_ROLE_NAME \
--policy-arn arn:aws:iam::$ACCOUNT_ID:policy/$AUTOSCALER_POLICY_NAME
Remove the role from the instance profile:
aws iam remove-role-from-instance-profile \
--instance-profile-name $AUTOSCALER_INSTANCE_PROFILE_NAME \
--role-name $AUTOSCALER_ROLE_NAME
Delete the instance profile:
aws iam delete-instance-profile \
--instance-profile-name $AUTOSCALER_INSTANCE_PROFILE_NAME
Delete the IAM role:
aws iam delete-role \
--role-name $AUTOSCALER_ROLE_NAME
Delete the IAM policy:
aws iam delete-policy \
--policy-arn arn:aws:iam::$ACCOUNT_ID:policy/$AUTOSCALER_POLICY_NAME