[24/24] E is for Etcd: Understanding the Brain of Kubernetes

This is Post #24 in the Kubernetes A-to-Z Series

Reading Order: Previous: Best Practices | Series Complete!

Series Progress: 24/24 complete | Difficulty: Advanced | Time: 25 min | Part 6/6: Security & Production

We’ve covered the API Server, Controller Manager, and Scheduler in our architecture post. But where does all that state actually live? Enter etcd.

In this post, we’ll explore the “E” of Kubernetes: Etcd.

What is Etcd?

Etcd is a strongly consistent, distributed key-value store. It is the “source of truth” for Kubernetes. Every pod, service, and secret you create is ultimately stored in etcd.

Why not SQL?

Kubernetes needs a system that prioritizes consistency and partition tolerance (CP in CAP theorem). When you ask the API server “how many pods are running?”, you need the exact, most up-to-date answer, not “maybe 3, maybe 4”.

How Etcd Works

Key-Value Store

Data is stored in a hierarchical directory-like structure.

/registry/pods/default/nginx-pod
/registry/services/specs/default/my-service

Consensus (Raft)

Etcd uses the Raft consensus algorithm to ensure that all nodes in the cluster agree on the state.

Leader: One node is elected leader. All writes go to the leader.
Followers: Replicate the data from the leader.
Quorum: A write is only considered “committed” if a majority of nodes (N/2 + 1) acknowledge it.

Fun Fact: This is why you always run etcd in odd numbers (3, 5, 7). A 3-node cluster can survive 1 failure. A 5-node cluster can survive 2.

Interacting with Etcd

You typically interact with etcd via kubectl (which talks to the API server). But you can talk to it directly using etcdctl for debugging or backups.

# Get a value
etcdctl get /registry/pods/default/nginx

# Watch for changes
etcdctl watch /registry/events --prefix

Backup and Restore

Because etcd holds all your cluster state, backing it up is critical. If you lose etcd, you lose your cluster.

Snapshot Backup

ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  snapshot save /tmp/etcd-backup.db

Restore

Restoring is dangerous and usually requires stopping the API server.

ETCDCTL_API=3 etcdctl snapshot restore /tmp/etcd-backup.db \
  --data-dir /var/lib/etcd-new

High Availability

For production, you should run etcd as a cluster of 3 or 5 nodes.

Stacked Control Plane: Etcd runs on the same nodes as the API server (easier to manage).
External Etcd: Etcd runs on dedicated nodes (better performance and isolation).

Performance Tuning

Etcd is sensitive to disk latency.

SSD is mandatory: Do not run etcd on spinning disks.
Network latency: Keep nodes in the same region/zone if possible to minimize sync time.

Summary

Etcd is the unsung hero of Kubernetes. It’s simple in concept but critical in function.

Consistency: It guarantees the cluster state is accurate.
Resilience: It survives node failures via Raft.
Criticality: Back it up regularly!

Next Steps

Now that we understand the brain, you have completed the series! You now have a comprehensive understanding of Kubernetes from A to Z.

Series Complete!

Congratulations on completing the Kubernetes A-to-Z Series! You now have comprehensive knowledge covering:

Foundation: Kubernetes Architecture, Containers, Pods
Workloads: Deployments, ReplicaSets, Jobs, CronJobs
Networking: Services, Ingress, Network Policies
Storage: Volumes, ConfigMaps, Secrets
Operations: Helm, Operators, Logging, Monitoring, Troubleshooting, Upgrades
Security: RBAC, Authentication, Best Practices
Advanced: Federation, Zero-Downtime Deployments, GitOps, Etcd

Series Navigation:

Previous: B is for Best Practices
Series Complete!

Complete Series: Kubernetes A-to-Z Series Overview