etcd database backs Kubernetes control plane state, so etcd health is critical for Kubernetes availability.
Note: Commands fromtalosctl etcdnamespace are functional only on the Talos control plane nodes. Each time you see<IPx>in this page, it is referencing IP address of control plane node.
Space Quota
etcd default database space quota is set to 2 GiB by default.
If the database size exceeds the quota, etcd will stop operations until the issue is resolved.
This condition can be checked with talosctl etcd alarm list command:
- Command
- Output
etcd section in the machine configuration:
talosctl etcd alarm disarm to clear the NOSPACE alarm.
Defragmentation
etcd database can become fragmented over time if there are lots of writes and deletes.
Kubernetes API server performs automatic compaction of the etcd database, which marks deleted space as free and ready to be reused.
However, the space is not actually freed until the database is defragmented.
If the database is heavily fragmented (in use/db size ratio is less than 0.5), defragmentation might increase the performance.
If the database runs over the space quota (see above), but the actual in use database size is small, defragmentation is required to bring the on-disk database size below the limit.
Current database size can be checked with talosctl etcd status command:
- Command
- Output
ERRORS column.
To defragment the database, run talosctl etcd defrag command:
Note: Defragmentation is a resource-intensive operation, so it is recommended to run it on a single node at a time. Defragmentation to a live member blocks the system from reading and writing data while rebuilding its state.Once the defragmentation is complete, the database size will match closely to the in use size:
- Command
- Output
Snapshotting
Regular backups ofetcd database should be performed to ensure that the cluster can be restored in case of a failure.
This procedure is described in the disaster recovery guide.
Downgrade v3.6 to v3.5
Before beginning, check theetcd health and download snapshot, as described in disaster recovery.
Should something go wrong with the downgrade, it is possible to use this backup to rollback to existing etcd version.
This example shows how to downgrade an etcd in Talos cluster.
Step 1: Check Downgrade Requirements
Is the cluster healthy and running v3.6.x?- Command
- Output
Step 2: Download Snapshot
Download the snapshot backup to provide a downgrade path should any problems occur.Step 3: Validate Downgrade
Validate the downgrade target version before enabling the downgrade:- We only support downgrading one minor version at a time, e.g. downgrading from v3.6 to v3.4 isnβt allowed.
- Please do not move on to next step until the validation is successful.
- Command
- Output
Step 4: Enable Downgrade
- Command
- Output
etcd will automatically migrate the schema to the downgrade target version, which usually happens very fast.
Confirm the storage version of all servers has been migrated to v3.5 by checking the endpoint status before moving on to the next step.
- Command
- Output
Note: Once downgrade is enabled, the cluster will remain operating with v3.5 protocol even if all the servers are still running the v3.6 binary, unless the downgrade is canceled with talosctl -n <IP1> downgrade cancel.
Step 5: Patch Machine Config
Before patching the node, check if the etcd is leader. We recommend downgrading the leader last. If the server to be downgraded is the leader, you can avoid some downtime byforfeit-leadership to another server before stopping this server.
etcd image:
etcd version.
- Command
- Output
etcd:
- Command
- Output
Step 6: Continue on the Remaining Control Plane Nodes
When all members are downgraded, check the health and status of the cluster, and confirm the minor version of all members is v3.5, and storage version is empty:- Command
- Output