Kubernetes Backup and Disaster Recovery: Velero and etcd
If your Kubernetes backup strategy exists only on paper, you do not have a backup strategy. Production readiness means measurable RPO/RTO targets and restore drills that prove they can be met.
1. Define RPO and RTO per Service Tier
- Tier 0 (critical): low RPO, low RTO
- Tier 1 (important): moderate RPO/RTO
- Tier 2 (non-critical): relaxed targets
Without service tiers, all backup discussions become vague and expensive.
2. What Must Be Backed Up
- Cluster state (etcd snapshots)
- Namespaced resources (Deployments, Services, ConfigMaps, Secrets)
- Persistent volume data
- GitOps repositories and CI/CD definitions
- External dependencies configuration (DNS, certificates, IAM)
3. etcd Snapshot Strategy
For self-managed control planes, etcd snapshot is mandatory.
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-$(date +%Y%m%d-%H%M).db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
Best practices:
- frequent snapshots with retention policy
- encrypted off-cluster storage
- checksum verification
- periodic restore test in isolated environment
4. Velero for Cluster Objects and Volumes
Velero covers Kubernetes resource backups and, with plugins, persistent volume backups.
velero install \
--provider aws \
--plugins velero/velero-plugin-for-aws:v1.9.0 \
--bucket your-backup-bucket \
--secret-file ./credentials-velero
Example scheduled backup:
velero schedule create daily-prod \
--schedule="0 2 * * *" \
--include-namespaces production
5. Restore Playbooks
Document both partial and full-cluster recovery.
Partial restore use case:
- restore a namespace after accidental deletion.
Full restore use case:
- region outage or control plane corruption.
Velero namespace restore example:
velero restore create --from-backup daily-prod-2026-03-01
6. Run DR Drills Regularly
A recommended cadence:
- monthly: namespace-level restore drill
- quarterly: cluster-level DR simulation
- after major platform changes: targeted restore validation
Capture outcomes:
- actual RTO
- data loss window (actual RPO)
- blocked steps and manual interventions
7. Common Failure Modes to Prepare For
- deleted namespace or wrong
kubectl apply - corrupted etcd data
- cloud region outage
- expired credentials for backup storage
- incompatible restore after Kubernetes version change
8. Operational Checklist
- Backups encrypted in transit and at rest.
- Backup jobs monitored with alerts.
- Restore runbooks versioned in Git.
- Ownership clearly assigned.
- Restore drills produce evidence and action items.
Minimum Viable DR Strategy
- etcd snapshots + offsite retention
- Velero scheduled backups for production namespaces
- Monthly restore drills with measured RTO/RPO
- Written and tested runbooks for top failure scenarios
The difference between “we have backups” and “we can recover” is testing.