Procedure for snapshotting etcd database and recovering from catastrophic control plane failure.
etcd
database backs Kubernetes control plane state, so if the etcd
service is unavailable,
the Kubernetes control plane goes down, and the cluster is not recoverable until etcd
is recovered.
etcd
builds around the consensus protocol Raft, so highly-available control plane clusters can tolerate the loss of nodes so long as more than half of the members are running and reachable.
For a three control plane node Talos cluster, this means that the cluster tolerates a failure of any single node,
but losing more than one node at the same time leads to complete loss of service.
Because of that, it is important to take routine backups of etcd
state to have a snapshot to recover the cluster from
in case of catastrophic failure.
etcd
Databaseetcd
database with talosctl etcd snapshot
command:
Note: filename db.snapshot
is arbitrary.
This database snapshot can be taken on any healthy control plane node (with IP address <IP>
in the example above),
as all etcd
instances contain exactly same data.
It is recommended to configure etcd
snapshots to be created on some schedule to allow point-in-time recovery using the latest snapshot.
etcd
cluster is not healthy (for example, if quorum has already been lost), the talosctl etcd snapshot
command might fail.
In that case, copy the database snapshot directly from the control plane node:
etcd
process is running), but it allows
for disaster recovery when latest regular snapshot is not available.
etcd
cluster canโt be recovered:
etcd
cluster member list on all healthy control plane nodes with talosctl -n IP etcd members
command and compare across all members.etcd
health across control plane nodes with talosctl -n IP service etcd
.etcd
database snapshot.
If a snapshot is not fresh enough, create a database snapshot (see above), even if the etcd
cluster is unhealthy.
init
:
etcd
recovery procedure.
init
node can be converted to controlplane
type with talosctl edit mc --mode=staged
command followed
by node reboot with talosctl reboot
command.
etcd
isnโt, wipe the nodeโs EPHEMERAL partition to remove the etcd
data directory (make sure a database snapshot is taken before doing this):
etcd
service should be in the Preparing
state.
The Kubernetes control plane endpoint should be pointed to the new control plane nodes if there were
changes to the node addresses.
etcd
service instances are in Preparing
state:
etcd
database snapshot:
Note: if database snapshot was copied out directly from theTalos node should print matching information in the kernel log:etcd
data directory usingtalosctl cp
, add flag--recover-skip-hash-check
to skip integrity check on restore.
etcd
service should become healthy on the bootstrap node, Kubernetes control plane components
should start and control plane endpoint should become available.
Remaining control plane nodes join etcd
cluster once control plane endpoint is up.
etcd
database in single control plane node
case, as loss of the control plane node might render the whole cluster irrecoverable without a backup.