[24/24] E is for Etcd: Understanding the Brain of Kubernetes
This is Post #24 in the Kubernetes A-to-Z Series
Reading Order: Previous: Best Practices | Series Complete!
Series Progress: 24/24 complete | Difficulty: Advanced | Time: 25 min | Part 6/6: Security & Production
We’ve covered the API Server, Controller Manager, and Scheduler in our architecture post. But where does all that state actually live? Enter etcd.
In this post, we’ll explore the “E” of Kubernetes: Etcd.
What is Etcd?
Etcd is a strongly consistent, distributed key-value store. It is the “source of truth” for Kubernetes. Every pod, service, and secret you create is ultimately stored in etcd.
Why not SQL?
Kubernetes needs a system that prioritizes consistency and partition tolerance (CP in CAP theorem). When you ask the API server “how many pods are running?”, you need the exact, most up-to-date answer, not “maybe 3, maybe 4”.
How Etcd Works
Key-Value Store
Data is stored in a hierarchical directory-like structure.
/registry/pods/default/nginx-pod/registry/services/specs/default/my-service
Consensus (Raft)
Etcd uses the Raft consensus algorithm to ensure that all nodes in the cluster agree on the state.
- Leader: One node is elected leader. All writes go to the leader.
- Followers: Replicate the data from the leader.
- Quorum: A write is only considered “committed” if a majority of nodes (N/2 + 1) acknowledge it.
Fun Fact: This is why you always run etcd in odd numbers (3, 5, 7). A 3-node cluster can survive 1 failure. A 5-node cluster can survive 2.
Interacting with Etcd
You typically interact with etcd via kubectl (which talks to the API server). But you can talk to it directly using etcdctl for debugging or backups.
# Get a value
etcdctl get /registry/pods/default/nginx
# Watch for changes
etcdctl watch /registry/events --prefix
Backup and Restore
Because etcd holds all your cluster state, backing it up is critical. If you lose etcd, you lose your cluster.
Snapshot Backup
ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
snapshot save /tmp/etcd-backup.db
Restore
Restoring is dangerous and usually requires stopping the API server.
ETCDCTL_API=3 etcdctl snapshot restore /tmp/etcd-backup.db \
--data-dir /var/lib/etcd-new
High Availability
For production, you should run etcd as a cluster of 3 or 5 nodes.
- Stacked Control Plane: Etcd runs on the same nodes as the API server (easier to manage).
- External Etcd: Etcd runs on dedicated nodes (better performance and isolation).
Performance Tuning
Etcd is sensitive to disk latency.
- SSD is mandatory: Do not run etcd on spinning disks.
- Network latency: Keep nodes in the same region/zone if possible to minimize sync time.
Summary
Etcd is the unsung hero of Kubernetes. It’s simple in concept but critical in function.
- Consistency: It guarantees the cluster state is accurate.
- Resilience: It survives node failures via Raft.
- Criticality: Back it up regularly!
Next Steps
Now that we understand the brain, you have completed the series! You now have a comprehensive understanding of Kubernetes from A to Z.
Series Complete!
Congratulations on completing the Kubernetes A-to-Z Series! You now have comprehensive knowledge covering:
- Foundation: Kubernetes Architecture, Containers, Pods
- Workloads: Deployments, ReplicaSets, Jobs, CronJobs
- Networking: Services, Ingress, Network Policies
- Storage: Volumes, ConfigMaps, Secrets
- Operations: Helm, Operators, Logging, Monitoring, Troubleshooting, Upgrades
- Security: RBAC, Authentication, Best Practices
- Advanced: Federation, Zero-Downtime Deployments, GitOps, Etcd
Series Navigation:
- Previous: B is for Best Practices
- Series Complete!
Complete Series: Kubernetes A-to-Z Series Overview