Kubernetes Observability Stack: Prometheus, OpenTelemetry, and Loki
Observability is not collecting more telemetry. It is reducing time to detect and resolve failures. A useful stack must connect metrics, logs, and traces around service-level objectives.
1. Architecture Overview
A practical stack:
- Metrics: Prometheus
- Logs: Loki
- Traces: OpenTelemetry Collector + tracing backend
- Visualization: Grafana
- Alerting: Alertmanager integrated with on-call channels
2. Metrics: Start with Golden Signals
For each service, track:
- latency
- traffic
- errors
- saturation
Also include Kubernetes platform metrics:
- pod restart rate
- node pressure signals
- API server latency/error rate
- HPA scaling events
3. Logging: Structured and Queryable
Guidelines:
- JSON logs with stable fields
- correlation fields (
trace_id,span_id,request_id) - log level discipline
- retention by workload criticality
Avoid storing everything forever. Use retention tiers.
4. Tracing: Instrument Critical Paths
Instrument high-value workflows first:
- login/authentication
- checkout/payment
- async pipeline handoff
OpenTelemetry Collector should handle:
- sampling strategy
- attribute processing
- export routing
5. SLO-Driven Alerting
Alert on user impact, not raw infrastructure noise.
Examples:
- API p95 latency above SLO threshold
- error budget burn rate too high
- sustained 5xx increase
Use multi-window, multi-burn-rate alerts to reduce false positives.
6. Dashboard Design
Each service dashboard should include:
- SLO status + error budget
- traffic, latency, errors
- dependency health
- recent deployments and incidents
Add direct links to runbooks and recent trace samples.
7. Common Anti-Patterns
- too many low-value alerts
- no ownership on dashboards
- metrics/logs/traces disconnected
- missing deploy metadata in telemetry
8. Rollout Plan
Phase 1:
- standardize metrics and logging format
- define SLOs for top 5 critical services
Phase 2:
- trace critical request paths
- tune alerts based on incident feedback
Phase 3:
- observability coverage targets by team
- cost optimization for telemetry retention
Production Checklist
- Golden signals and SLOs defined.
- Logs include trace correlation fields.
- Tracing covers critical workflows.
- Alert rules tested against incident scenarios.
- Runbook links embedded in dashboards.
With this baseline, your observability stack becomes an operational decision system, not a data landfill.