Kubernetes Backup and Disaster Recovery: Velero and etcd


If your Kubernetes backup strategy exists only on paper, you do not have a backup strategy. Production readiness means measurable RPO/RTO targets and restore drills that prove they can be met.

1. Define RPO and RTO per Service Tier

  • Tier 0 (critical): low RPO, low RTO
  • Tier 1 (important): moderate RPO/RTO
  • Tier 2 (non-critical): relaxed targets

Without service tiers, all backup discussions become vague and expensive.

2. What Must Be Backed Up

  • Cluster state (etcd snapshots)
  • Namespaced resources (Deployments, Services, ConfigMaps, Secrets)
  • Persistent volume data
  • GitOps repositories and CI/CD definitions
  • External dependencies configuration (DNS, certificates, IAM)

3. etcd Snapshot Strategy

For self-managed control planes, etcd snapshot is mandatory.

ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-$(date +%Y%m%d-%H%M).db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

Best practices:

  • frequent snapshots with retention policy
  • encrypted off-cluster storage
  • checksum verification
  • periodic restore test in isolated environment

4. Velero for Cluster Objects and Volumes

Velero covers Kubernetes resource backups and, with plugins, persistent volume backups.

velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.9.0 \
  --bucket your-backup-bucket \
  --secret-file ./credentials-velero

Example scheduled backup:

velero schedule create daily-prod \
  --schedule="0 2 * * *" \
  --include-namespaces production

5. Restore Playbooks

Document both partial and full-cluster recovery.

Partial restore use case:

  • restore a namespace after accidental deletion.

Full restore use case:

  • region outage or control plane corruption.

Velero namespace restore example:

velero restore create --from-backup daily-prod-2026-03-01

6. Run DR Drills Regularly

A recommended cadence:

  • monthly: namespace-level restore drill
  • quarterly: cluster-level DR simulation
  • after major platform changes: targeted restore validation

Capture outcomes:

  • actual RTO
  • data loss window (actual RPO)
  • blocked steps and manual interventions

7. Common Failure Modes to Prepare For

  • deleted namespace or wrong kubectl apply
  • corrupted etcd data
  • cloud region outage
  • expired credentials for backup storage
  • incompatible restore after Kubernetes version change

8. Operational Checklist

  • Backups encrypted in transit and at rest.
  • Backup jobs monitored with alerts.
  • Restore runbooks versioned in Git.
  • Ownership clearly assigned.
  • Restore drills produce evidence and action items.

Minimum Viable DR Strategy

  1. etcd snapshots + offsite retention
  2. Velero scheduled backups for production namespaces
  3. Monthly restore drills with measured RTO/RPO
  4. Written and tested runbooks for top failure scenarios

The difference between “we have backups” and “we can recover” is testing.