Skip to main content
Etcd holds all Kubernetes cluster state. If etcd becomes corrupted or unavailable, restoring from a backup is the primary recovery path. This guide walks through restoring an etcd snapshot to a cluster managed by cluster templates in Omni. Before following this guide, ensure you have created etcd backups for your cluster. If you have not, see Create Etcd Backups.

Prerequisites

  • omnictl must be installed and configured.
  • talosctl must be installed and talosconfig must be configured for the cluster you want to restore. You can download talosconfig from the Omni UI or by running omnictl talosconfig -c <cluster-name>.
  • The cluster must still exist in Omni and must not have been deleted. Omni cannot restore a cluster that no longer exists as a resource.
  • The cluster must be managed using cluster templates. If your cluster was created via the UI, see Export a Cluster Template to convert it first.
  • At least one etcd backup must be available for the cluster.

Step 1: Find the cluster UUID

Start by exporting your cluster name as an environment variable. Replace <cluster-name> with the name of your cluster:
export CLUSTER_NAME=<cluster-name>
Then retrieve the cluster UUID:
omnictl get clusteruuid $CLUSTER_NAME
The output will look similar to this:
NAMESPACE   TYPE          ID              VERSION   UUID
default     ClusterUUID   my-cluster      1         bb874758-ee54-4d3b-bac3-4c8349737298
Note the value in the UUID column, you will need it in a later step.

Step 2: Find the snapshot to restore

List the available snapshots for the cluster:
omnictl get etcdbackup -l omni.sidero.dev/cluster=$CLUSTER_NAME
The output will look similar to this:
NAMESPACE   TYPE         ID                         VERSION     CREATED AT                         SNAPSHOT
external    EtcdBackup   my-cluster-1701184522   undefined   {"nanos":0,"seconds":1701184522}   FFFFFFFF9A99FBF6.snapshot
external    EtcdBackup   my-cluster-1701184515   undefined   {"nanos":0,"seconds":1701184515}   FFFFFFFF9A99FBFD.snapshot
external    EtcdBackup   my-cluster-1701184500   undefined   {"nanos":0,"seconds":1701184500}   FFFFFFFF9A99FC0C.snapshot
Note the value in the SNAPSHOT column for the snapshot you want to restore, you will need this in a later step. Use the CREATED AT timestamp to identify the most appropriate snapshot.

Step 3: Delete the existing control plane

To restore etcd, the existing control plane must be deleted first. This puts the cluster into a non-bootstrapped state so that a new control plane can be created with the restored etcd snapshot. Run the following command to delete the existing control planes:
omnictl delete machineset $CLUSTER_NAME-control-planes

Step 4: Create the restore template

Retrieve your existing cluster template file. If you do not have it locally, you can export it by running:
omnictl cluster template export -c $CLUSTER_NAME > restore-template.yaml
Open the file and add a bootstrapSpec block to the ControlPlane section, substituting the cluster UUID from Step 1 and the snapshot name from Step 2:
kind: Cluster
name: <cluster-name>
kubernetes:
  version: v1.28.2
talos:
  version: v1.5.5
---
kind: ControlPlane
machines:
  - <controlplane-machine-uuid-1>
  - <controlplane-machine-uuid-2>
  - <controlplane-machine-uuid-3>
bootstrapSpec:
  clusterUUID: <cluster-uuid>       # UUID from Step 1
  snapshot: <snapshot-name>         # snapshot name from Step 2
---
kind: Workers
machines:
  - <worker-machine-uuid-1>
  - <worker-machine-uuid-2>

Step 5: Sync the restore template

Apply the template to trigger the restore:
omnictl cluster template sync -f restore-template.yaml
Monitor the status until the cluster is fully restored:
omnictl cluster template status -f restore-template.yaml

Step 6: Restart kubelet on worker nodes

After the restore completes, kubelet must be restarted on all worker nodes to ensure healthy cluster operation. First, retrieve the IDs of the worker nodes:
omnictl get clustermachine -l omni.sidero.dev/role-worker,omni.sidero.dev/cluster=$CLUSTER_NAME
The output will look similar to this:
NAMESPACE   TYPE             ID                                     VERSION
default     ClusterMachine   26b87860-38b4-400f-af72-bc8d26ab6cd6   3
default     ClusterMachine   2f6af2ad-bebb-42a5-b6b0-2b9397acafbc   3
default     ClusterMachine   5f93376a-95f6-496c-b4b7-630a0607ac7f   3
default     ClusterMachine   c863ccdf-cdb7-4519-878e-5484a1be119a   3
Run a kubelet restart for each worker node ID in the output, replacing each ID with those from your cluster:
talosctl -n <worker-node-id-1> service kubelet restart
talosctl -n <worker-node-id-2> service kubelet restart
talosctl -n <worker-node-id-3> service kubelet restart
talosctl -n <worker-node-id-4> service kubelet restart

Step 7: Verify the restore

Confirm that all nodes are ready and the cluster is healthy:
kubectl get nodes
All nodes should show a status of Ready. If any nodes remain NotReady after a few minutes, check the kubelet status on the affected node:
talosctl -n <node-id> service kubelet
You can also verify etcd membership is healthy by running:
talosctl -n <controlplane-node-id> etcd members