Kubernetes Observability Stack: Prometheus, OpenTelemetry, and Loki


Observability is not collecting more telemetry. It is reducing time to detect and resolve failures. A useful stack must connect metrics, logs, and traces around service-level objectives.

1. Architecture Overview

A practical stack:

  • Metrics: Prometheus
  • Logs: Loki
  • Traces: OpenTelemetry Collector + tracing backend
  • Visualization: Grafana
  • Alerting: Alertmanager integrated with on-call channels

2. Metrics: Start with Golden Signals

For each service, track:

  • latency
  • traffic
  • errors
  • saturation

Also include Kubernetes platform metrics:

  • pod restart rate
  • node pressure signals
  • API server latency/error rate
  • HPA scaling events

3. Logging: Structured and Queryable

Guidelines:

  • JSON logs with stable fields
  • correlation fields (trace_id, span_id, request_id)
  • log level discipline
  • retention by workload criticality

Avoid storing everything forever. Use retention tiers.

4. Tracing: Instrument Critical Paths

Instrument high-value workflows first:

  • login/authentication
  • checkout/payment
  • async pipeline handoff

OpenTelemetry Collector should handle:

  • sampling strategy
  • attribute processing
  • export routing

5. SLO-Driven Alerting

Alert on user impact, not raw infrastructure noise.

Examples:

  • API p95 latency above SLO threshold
  • error budget burn rate too high
  • sustained 5xx increase

Use multi-window, multi-burn-rate alerts to reduce false positives.

6. Dashboard Design

Each service dashboard should include:

  • SLO status + error budget
  • traffic, latency, errors
  • dependency health
  • recent deployments and incidents

Add direct links to runbooks and recent trace samples.

7. Common Anti-Patterns

  • too many low-value alerts
  • no ownership on dashboards
  • metrics/logs/traces disconnected
  • missing deploy metadata in telemetry

8. Rollout Plan

Phase 1:

  • standardize metrics and logging format
  • define SLOs for top 5 critical services

Phase 2:

  • trace critical request paths
  • tune alerts based on incident feedback

Phase 3:

  • observability coverage targets by team
  • cost optimization for telemetry retention

Production Checklist

  1. Golden signals and SLOs defined.
  2. Logs include trace correlation fields.
  3. Tracing covers critical workflows.
  4. Alert rules tested against incident scenarios.
  5. Runbook links embedded in dashboards.

With this baseline, your observability stack becomes an operational decision system, not a data landfill.