Stateful Workloads on Kubernetes: PostgreSQL and Kafka Operators


Stateful workloads on Kubernetes are viable in production when you design around failure domains, storage guarantees, and operational automation. Operators make this manageable by encoding operational knowledge.

1. When Kubernetes is a Good Fit for Stateful Systems

Use Kubernetes when you need:

  • consistent platform operations across stateless and stateful services
  • declarative lifecycle management
  • automated failover and routine maintenance via operators

Avoid it when your team has no storage or SRE maturity yet.

2. Storage Fundamentals You Cannot Skip

For PostgreSQL and Kafka:

  • use fast, durable storage classes
  • ensure zone-aware scheduling
  • avoid oversubscribed IOPS for write-heavy workloads
  • validate backup and restore performance, not only backup completion

3. PostgreSQL with Operators

A PostgreSQL operator can automate:

  • primary/replica management
  • failover
  • backups and point-in-time recovery
  • version upgrades

Key operational checks:

  • replication lag SLO
  • backup success and restore test evidence
  • connection pool saturation

4. Kafka with Operators

Kafka operators help with:

  • broker lifecycle
  • topic and user management
  • rolling upgrades
  • certificate and listener configuration

Design considerations:

  • partition count strategy aligned with throughput and consumer scaling
  • replication factor based on failure tolerance
  • rack/zone awareness to reduce correlated failures

5. Failure Scenarios to Plan

PostgreSQL:

  • primary node loss
  • storage latency spikes
  • WAL archive failures

Kafka:

  • broker loss under rebalance pressure
  • under-replicated partitions
  • ISR shrink during network instability

Each scenario needs a tested runbook.

6. Backup and Recovery Strategy

  • PostgreSQL: base backups + WAL archival with restore drills
  • Kafka: topic replication + cross-cluster replication for critical streams
  • define clear RPO/RTO by data domain

7. Performance and Cost Balance

  • right-size CPU/memory by workload profile
  • isolate noisy neighbors with dedicated node pools if required
  • tune retention policies to control storage growth
  • track cost per TB and throughput unit

8. Security Baseline

  • encryption in transit and at rest
  • strict RBAC for operator CRDs
  • secret rotation for database and broker credentials
  • network policies around data-plane components

9. Practical Adoption Path

  1. Start with one non-critical stateful workload.
  2. Adopt operator defaults before custom tuning.
  3. Build backup/restore and failover drills.
  4. Expand to critical systems after operational confidence.

Kubernetes can run stateful systems reliably, but only if you treat operations as first-class engineering work.