Stateful Workloads on Kubernetes: PostgreSQL and Kafka Operators

Stateful workloads on Kubernetes are viable in production when you design around failure domains, storage guarantees, and operational automation. Operators make this manageable by encoding operational knowledge.

1. When Kubernetes is a Good Fit for Stateful Systems

Use Kubernetes when you need:

consistent platform operations across stateless and stateful services
declarative lifecycle management
automated failover and routine maintenance via operators

Avoid it when your team has no storage or SRE maturity yet.

2. Storage Fundamentals You Cannot Skip

For PostgreSQL and Kafka:

use fast, durable storage classes
ensure zone-aware scheduling
avoid oversubscribed IOPS for write-heavy workloads
validate backup and restore performance, not only backup completion

3. PostgreSQL with Operators

A PostgreSQL operator can automate:

primary/replica management
failover
backups and point-in-time recovery
version upgrades

Key operational checks:

replication lag SLO
backup success and restore test evidence
connection pool saturation

4. Kafka with Operators

Kafka operators help with:

broker lifecycle
topic and user management
rolling upgrades
certificate and listener configuration

Design considerations:

partition count strategy aligned with throughput and consumer scaling
replication factor based on failure tolerance
rack/zone awareness to reduce correlated failures

5. Failure Scenarios to Plan

PostgreSQL:

primary node loss
storage latency spikes
WAL archive failures

Kafka:

broker loss under rebalance pressure
under-replicated partitions
ISR shrink during network instability

Each scenario needs a tested runbook.

6. Backup and Recovery Strategy

PostgreSQL: base backups + WAL archival with restore drills
Kafka: topic replication + cross-cluster replication for critical streams
define clear RPO/RTO by data domain

7. Performance and Cost Balance

right-size CPU/memory by workload profile
isolate noisy neighbors with dedicated node pools if required
tune retention policies to control storage growth
track cost per TB and throughput unit

8. Security Baseline

encryption in transit and at rest
strict RBAC for operator CRDs
secret rotation for database and broker credentials
network policies around data-plane components

9. Practical Adoption Path

Start with one non-critical stateful workload.
Adopt operator defaults before custom tuning.
Build backup/restore and failover drills.
Expand to critical systems after operational confidence.

Kubernetes can run stateful systems reliably, but only if you treat operations as first-class engineering work.