Kubernetes Backup and Disaster Recovery: Velero and etcd

If your Kubernetes backup strategy exists only on paper, you do not have a backup strategy. Production readiness means measurable RPO/RTO targets and restore drills that prove they can be met.

1. Define RPO and RTO per Service Tier

Tier 0 (critical): low RPO, low RTO
Tier 1 (important): moderate RPO/RTO
Tier 2 (non-critical): relaxed targets

Without service tiers, all backup discussions become vague and expensive.

2. What Must Be Backed Up

Cluster state (etcd snapshots)
Namespaced resources (Deployments, Services, ConfigMaps, Secrets)
Persistent volume data
GitOps repositories and CI/CD definitions
External dependencies configuration (DNS, certificates, IAM)

3. etcd Snapshot Strategy

For self-managed control planes, etcd snapshot is mandatory.

ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-$(date +%Y%m%d-%H%M).db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

Best practices:

frequent snapshots with retention policy
encrypted off-cluster storage
checksum verification
periodic restore test in isolated environment

4. Velero for Cluster Objects and Volumes

Velero covers Kubernetes resource backups and, with plugins, persistent volume backups.

velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.9.0 \
  --bucket your-backup-bucket \
  --secret-file ./credentials-velero

Example scheduled backup:

velero schedule create daily-prod \
  --schedule="0 2 * * *" \
  --include-namespaces production

5. Restore Playbooks

Document both partial and full-cluster recovery.

Partial restore use case:

restore a namespace after accidental deletion.

Full restore use case:

region outage or control plane corruption.

Velero namespace restore example:

velero restore create --from-backup daily-prod-2026-03-01

6. Run DR Drills Regularly

A recommended cadence:

monthly: namespace-level restore drill
quarterly: cluster-level DR simulation
after major platform changes: targeted restore validation

Capture outcomes:

actual RTO
data loss window (actual RPO)
blocked steps and manual interventions

7. Common Failure Modes to Prepare For

deleted namespace or wrong kubectl apply
corrupted etcd data
cloud region outage
expired credentials for backup storage
incompatible restore after Kubernetes version change

8. Operational Checklist

Backups encrypted in transit and at rest.
Backup jobs monitored with alerts.
Restore runbooks versioned in Git.
Ownership clearly assigned.
Restore drills produce evidence and action items.

Minimum Viable DR Strategy

etcd snapshots + offsite retention
Velero scheduled backups for production namespaces
Monthly restore drills with measured RTO/RPO
Written and tested runbooks for top failure scenarios

The difference between “we have backups” and “we can recover” is testing.