Prerequisites
Before you begin you must have:kubectlandtalosctlinstalled- A cloud provider supported by Cluster Autoscaler and configured to create new worker nodes
Step 1: Download your Talos installation media
You’ll need two Talos images: one for your control plane machines and one for your worker machines. To create each image:- Click Download Installation Media in the Omni dashboard.
- Choose the platform that matches your environment (for example, AWS AMI, Azure, or a generic image).
- Add a Machine User Label that identifies the machine’s role. For example, for a control plane machine:
role:autoscaler-controlplane-machine. - Click Download to generate the image.
role:autoscaler-worker-machine).
Note: After downloading, some platforms require you to upload the generated file to create a bootable image, for example, an AMI in AWS, a Managed Image in Azure, or a Custom Image in GCP.
Step 2: Create control plane machines from your images
Use the control plane image to create your control plane machines. For high availability, we recommend creating at least three control plane machines.We do not recommend horizontally autoscaling control plane machines.
If your control plane needs more capacity, scale vertically by using larger machine types rather than adding more control plane nodes.Adding additional nodes increases the number of etcd replicas, which can slow down write performance.
Step 3: Create Machine Classes
Machine Classes group machines by labels and act as the pool description for clusters. You’ll create two Machine Classes:- one for control plane machines
- one for worker machines
3.1: Control plane Machine Class
To create a machine class for your control plane machines:- Create a file named
controlplane-machine-class.yaml:
cluster-autoscaler-controlplane machine class with your control plane AMI label ( e.g., role:autoscaler-controlplane-machine).
- Apply it:
- Verify it exists:
3.2: Worker Machine Class
Repeat the process for the worker machines:- Create a file named
worker-machine-class.yaml:
- Apply it:
- Verify:
Step 4: Create a cluster using a template
Next, create a cluster that uses the Machine Classes you defined in Step 3.- Create a file named
cluster-template.yamlwith the following content, updating each placeholder:
<cluster-name>— your cluster name<controlplane-class-name>— the Machine Class from Step 3.1<controlplane-count>— number of control plane nodes (for example, 3)<worker-class-name>— the Machine Class from Step 3.2
- Apply the template:
- In Omni, wait for the cluster to become healthy, then download and export its
kubeconfig. Replace<cluster-name>with the name of your cluster:
- Verify that all control plane nodes registered successfully:
Step 5: Create a node group that Cluster Autoscaler can scale
Cluster Autoscaler does not create nodes directly. Instead, it talks to a node group managed by your infrastructure provider and asks it to increase the number of worker nodes A node group is whatever your platform uses to define scalable worker nodes:- AWS - Auto Scaling Group (ASG)
- GCP - Managed Instance Group (MIG)
- Azure - VM Scale Set (VMSS)
- Cluster API - MachineDeployment
5.1: Create a worker node group using your Talos worker image
Using the worker image you downloaded in Step 1, create a node group with a minimum (your baseline workers) and maximum size (how far Cluster Autoscaler is allowed to scale).Note: Because the image was generated in Omni with your cluster’s join metadata and worker labels, any machine launched with this image, and therefore from this node group will automatically join your cluster as a worker node.
5.2: Verify nodes join before installing the Autoscaler
Before installing Cluster Autoscaler, manually scale your node group by one instance to confirm that it automatically registers as a worker node. Check the new node:- You used the Omni-generated worker image
- The correct worker label was set during image generation
- The image was not modified after download
5.3: Ensure the node group exposes a scalable API
Cluster Autoscaler must be able to modify your node group. It needs permission to:- Read the group’s current size
- Set the desired size
- List/describe instances
- Inspect group metadata (labels/tags)
- Clouds (AWS/GCP/Azure): Attach the appropriate IAM role or credentials to your nodes or to the Cluster Autoscaler service account (IRSA, workload identity, etc.).
- Other environments: Use a provider-specific controller that exposes a node-group interface supported by Cluster Autoscaler.
Step 6: Download the Cluster Autoscaler via Helm
You’ll use Helm to add the Cluster Autoscaler repository and install it into your cluster.6.1: Add the Cluster Autoscaler Helm repository
Add the repository to your local Helm configuration, then update your repo index:6.2: Install the Cluster Autoscaler chart
Different infrastructure providers require different configuration values when installing the Cluster Autoscaler chart. These values tell the Cluster Autoscaler:- Which node groups it should manage
- How to discover and scale those node groups
- What credentials or identity mechanism to use
- And any other provider-specific settings
6.3: Allow Cluster Autoscaler to run on your control plane nodes
Control plane nodes are tainted by default, which prevents general workloads from running on them. Add a toleration so that the Cluster Autoscaler can run on the control plane:Step 7: Verify Cluster Autoscaler is healthy
Check that the Cluster Autoscaler deployment is running:- Your cloud provider was initialized
- Node groups were discovered
- There are no recurring errors about credentials or API access
Step 8: Deploy a workload that triggers scale-up
With the Cluster Autoscaler running, you can test it by deploying a workload that requires more resources than your current worker nodes can supply. Create a file namedautoscaler-demo-deploy.yaml:
- Some pods starting in Pending due to insufficient resources
- The Cluster Autoscaler detecting those Pending pods and requesting additional nodes
- New worker nodes joining the cluster, followed by the pods transitioning to Running
Cleanup
Delete the resources created in this guide:- Remove the demo Deployment you used to trigger scale-up:
- Uninstall Cluster Autoscaler (optional):
- Scale down or delete your node group. The exact steps depend on your provider. For example, on AWS you might run:
- (If desired) Remove control plane machines. Only do this if you’re cleaning up the entire cluster, not just stopping autoscaling.
- Remove provider-side IAM / identity permissions (if you created them).