[15/24] L is for Logging and Monitoring: Observability in Kubernetes


This is Post #13 in the Kubernetes A-to-Z Series

Reading Order: Previous: YAML | Next: Troubleshooting

Series Progress: 15/24 complete | Difficulty: Intermediate | Time: 30 min | Part 5/6: Operations

Welcome to the thirteenth post in our Kubernetes A-to-Z Series! Now that you understand Operators, let’s explore Logging and Monitoring - the foundation of observability in Kubernetes. Understanding what’s happening inside your cluster is essential for maintaining reliable applications.

The Three Pillars of Observability

┌─────────────────────────────────────────────────┐
│  Observability                                  │
│                                                 │
│  ┌─────────┐  ┌─────────┐  ┌─────────┐         │
│  │  Logs   │  │ Metrics │  │ Traces  │         │
│  │         │  │         │  │         │         │
│  │ What    │  │ How     │  │ Where   │         │
│  │happened │  │ much    │  │ requests│         │
│  │         │  │         │  │ flow    │         │
│  └─────────┘  └─────────┘  └─────────┘         │
│                                                 │
│  Together: Complete system visibility           │
└─────────────────────────────────────────────────┘

Why Observability Matters

  • Troubleshooting: Quickly identify and resolve issues
  • Performance: Detect bottlenecks and optimize
  • Reliability: Proactive alerting before failures
  • Capacity Planning: Data-driven scaling decisions
  • Security: Detect anomalies and threats

Logging in Kubernetes

Logging Architecture

┌─────────────────────────────────────────────────┐
│  Kubernetes Logging                             │
│                                                 │
│  ┌─────────┐                                    │
│  │   Pod   │──► stdout/stderr                   │
│  └─────────┘         │                          │
│                      ▼                          │
│  ┌─────────────────────────────────────┐        │
│  │  Node (kubelet)                     │        │
│  │  /var/log/containers/*.log          │        │
│  └──────────────┬──────────────────────┘        │
│                 │                               │
│                 ▼                               │
│  ┌─────────────────────────────────────┐        │
│  │  Log Collector (Fluentd/Fluent Bit) │        │
│  └──────────────┬──────────────────────┘        │
│                 │                               │
│                 ▼                               │
│  ┌─────────────────────────────────────┐        │
│  │  Storage (Elasticsearch/Loki)       │        │
│  └─────────────────────────────────────┘        │
└─────────────────────────────────────────────────┘

Basic Log Commands

# View pod logs
kubectl logs pod-name
kubectl logs pod-name -c container-name  # specific container
kubectl logs pod-name --previous         # previous instance
kubectl logs pod-name -f                 # follow logs
kubectl logs pod-name --tail=100         # last 100 lines
kubectl logs pod-name --since=1h         # last hour

# View logs from multiple pods
kubectl logs -l app=webapp --all-containers
kubectl logs -l app=webapp -f

# View logs from deployment
kubectl logs deployment/webapp

Fluent Bit DaemonSet

# fluent-bit-daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluent-bit
  namespace: logging
spec:
  selector:
    matchLabels:
      app: fluent-bit
  template:
    metadata:
      labels:
        app: fluent-bit
    spec:
      serviceAccountName: fluent-bit
      containers:
      - name: fluent-bit
        image: fluent/fluent-bit:2.1
        volumeMounts:
        - name: varlog
          mountPath: /var/log
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
        - name: config
          mountPath: /fluent-bit/etc/
      volumes:
      - name: varlog
        hostPath:
          path: /var/log
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers
      - name: config
        configMap:
          name: fluent-bit-config
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
  namespace: logging
data:
  fluent-bit.conf: |
    [SERVICE]
        Flush         1
        Log_Level     info
        Parsers_File  parsers.conf

    [INPUT]
        Name              tail
        Path              /var/log/containers/*.log
        Parser            docker
        Tag               kube.*
        Refresh_Interval  5

    [FILTER]
        Name                kubernetes
        Match               kube.*
        Kube_URL            https://kubernetes.default.svc:443
        Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token
        Merge_Log           On

    [OUTPUT]
        Name            es
        Match           *
        Host            elasticsearch.logging.svc
        Port            9200
        Index           kubernetes
        Type            _doc

  parsers.conf: |
    [PARSER]
        Name        docker
        Format      json
        Time_Key    time
        Time_Format %Y-%m-%dT%H:%M:%S.%L

Loki for Log Aggregation

# loki-stack.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: loki-config
  namespace: logging
data:
  loki.yaml: |
    auth_enabled: false
    server:
      http_listen_port: 3100
    ingester:
      lifecycler:
        ring:
          kvstore:
            store: inmemory
          replication_factor: 1
    schema_config:
      configs:
        - from: 2020-01-01
          store: boltdb
          object_store: filesystem
          schema: v11
          index:
            prefix: index_
            period: 24h
    storage_config:
      boltdb:
        directory: /loki/index
      filesystem:
        directory: /loki/chunks
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: loki
  namespace: logging
spec:
  replicas: 1
  selector:
    matchLabels:
      app: loki
  template:
    metadata:
      labels:
        app: loki
    spec:
      containers:
      - name: loki
        image: grafana/loki:2.8.0
        ports:
        - containerPort: 3100
        volumeMounts:
        - name: config
          mountPath: /etc/loki
        - name: storage
          mountPath: /loki
      volumes:
      - name: config
        configMap:
          name: loki-config
      - name: storage
        persistentVolumeClaim:
          claimName: loki-pvc

Metrics with Prometheus

Prometheus Architecture

┌─────────────────────────────────────────────────┐
│  Prometheus Stack                               │
│                                                 │
│  ┌─────────┐  ┌─────────┐  ┌─────────┐         │
│  │ App Pod │  │ App Pod │  │ App Pod │         │
│  │ /metrics│  │ /metrics│  │ /metrics│         │
│  └────┬────┘  └────┬────┘  └────┬────┘         │
│       │            │            │               │
│       └────────────┼────────────┘               │
│                    │                            │
│                    ▼                            │
│  ┌─────────────────────────────────────┐        │
│  │  Prometheus Server                  │        │
│  │  - Scrapes metrics                  │        │
│  │  - Stores time-series data          │        │
│  │  - Evaluates alert rules            │        │
│  └──────────────┬──────────────────────┘        │
│                 │                               │
│       ┌─────────┴─────────┐                     │
│       ▼                   ▼                     │
│  ┌─────────┐        ┌─────────┐                 │
│  │ Grafana │        │Alertmgr │                 │
│  └─────────┘        └─────────┘                 │
└─────────────────────────────────────────────────┘

Installing Prometheus Stack

# Using Helm
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set grafana.adminPassword=admin123

ServiceMonitor for Application Metrics

# servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: webapp-monitor
  namespace: monitoring
  labels:
    release: prometheus
spec:
  namespaceSelector:
    matchNames:
    - production
  selector:
    matchLabels:
      app: webapp
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics
    scrapeTimeout: 10s

Prometheus Rules and Alerts

# prometheus-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: webapp-alerts
  namespace: monitoring
  labels:
    release: prometheus
spec:
  groups:
  - name: webapp.rules
    rules:
    # Recording rule
    - record: webapp:request_rate:5m
      expr: rate(http_requests_total{app="webapp"}[5m])

    # Alert rules
    - alert: HighErrorRate
      expr: |
        sum(rate(http_requests_total{app="webapp",status=~"5.."}[5m]))
        /
        sum(rate(http_requests_total{app="webapp"}[5m])) > 0.05
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "High error rate on {{ $labels.instance }}"
        description: "Error rate is {{ $value | humanizePercentage }}"

    - alert: HighLatency
      expr: |
        histogram_quantile(0.95,
          sum(rate(http_request_duration_seconds_bucket{app="webapp"}[5m])) by (le)
        ) > 1
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High latency detected"
        description: "P95 latency is {{ $value }}s"

    - alert: PodCrashLooping
      expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Pod {{ $labels.pod }} is crash looping"

    - alert: HighMemoryUsage
      expr: |
        container_memory_usage_bytes{container!=""}
        /
        container_spec_memory_limit_bytes{container!=""} > 0.9
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High memory usage on {{ $labels.pod }}"

Alertmanager Configuration

# alertmanager-config.yaml
apiVersion: v1
kind: Secret
metadata:
  name: alertmanager-config
  namespace: monitoring
stringData:
  alertmanager.yaml: |
    global:
      resolve_timeout: 5m
      slack_api_url: 'https://hooks.slack.com/services/xxx'

    route:
      group_by: ['alertname', 'namespace']
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 4h
      receiver: 'default'
      routes:
      - match:
          severity: critical
        receiver: 'pagerduty'
      - match:
          severity: warning
        receiver: 'slack'

    receivers:
    - name: 'default'
      slack_configs:
      - channel: '#alerts'
        send_resolved: true

    - name: 'slack'
      slack_configs:
      - channel: '#alerts-warning'
        send_resolved: true
        title: '{{ .Status | toUpper }}: {{ .CommonAnnotations.summary }}'
        text: '{{ .CommonAnnotations.description }}'

    - name: 'pagerduty'
      pagerduty_configs:
      - service_key: '<pagerduty-key>'
        send_resolved: true

Application Instrumentation

Go Application

package main

import (
    "net/http"
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
    httpRequestsTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total HTTP requests",
        },
        []string{"method", "path", "status"},
    )

    httpRequestDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "http_request_duration_seconds",
            Help:    "HTTP request duration",
            Buckets: prometheus.DefBuckets,
        },
        []string{"method", "path"},
    )
)

func init() {
    prometheus.MustRegister(httpRequestsTotal)
    prometheus.MustRegister(httpRequestDuration)
}

func main() {
    http.Handle("/metrics", promhttp.Handler())
    http.ListenAndServe(":8080", nil)
}

Node.js Application

const client = require('prom-client');
const express = require('express');

const app = express();

// Create metrics
const httpRequestsTotal = new client.Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'path', 'status']
});

const httpRequestDuration = new client.Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration',
  labelNames: ['method', 'path'],
  buckets: [0.1, 0.5, 1, 2, 5]
});

// Middleware
app.use((req, res, next) => {
  const start = Date.now();
  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;
    httpRequestsTotal.inc({ method: req.method, path: req.path, status: res.statusCode });
    httpRequestDuration.observe({ method: req.method, path: req.path }, duration);
  });
  next();
});

// Metrics endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', client.register.contentType);
  res.end(await client.register.metrics());
});

app.listen(8080);

Distributed Tracing with Jaeger

Jaeger Deployment

# jaeger.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: jaeger
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app: jaeger
  template:
    metadata:
      labels:
        app: jaeger
    spec:
      containers:
      - name: jaeger
        image: jaegertracing/all-in-one:1.47
        ports:
        - containerPort: 16686  # UI
        - containerPort: 14268  # HTTP collector
        - containerPort: 6831   # UDP agent
          protocol: UDP
        env:
        - name: COLLECTOR_ZIPKIN_HOST_PORT
          value: ":9411"
---
apiVersion: v1
kind: Service
metadata:
  name: jaeger
  namespace: monitoring
spec:
  selector:
    app: jaeger
  ports:
  - name: ui
    port: 16686
  - name: collector
    port: 14268
  - name: agent
    port: 6831
    protocol: UDP

OpenTelemetry Instrumentation

# otel-collector.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-collector-config
  namespace: monitoring
data:
  config.yaml: |
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:4318

    processors:
      batch:
        timeout: 1s
        send_batch_size: 1024

    exporters:
      jaeger:
        endpoint: jaeger:14250
        tls:
          insecure: true
      prometheus:
        endpoint: 0.0.0.0:8889

    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [batch]
          exporters: [jaeger]
        metrics:
          receivers: [otlp]
          processors: [batch]
          exporters: [prometheus]

Grafana Dashboards

Dashboard ConfigMap

# grafana-dashboard.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: webapp-dashboard
  namespace: monitoring
  labels:
    grafana_dashboard: "1"
data:
  webapp-dashboard.json: |
    {
      "dashboard": {
        "title": "Web Application Dashboard",
        "panels": [
          {
            "title": "Request Rate",
            "type": "graph",
            "targets": [
              {
                "expr": "sum(rate(http_requests_total{app=\"webapp\"}[5m]))",
                "legendFormat": "Requests/s"
              }
            ]
          },
          {
            "title": "Error Rate",
            "type": "graph",
            "targets": [
              {
                "expr": "sum(rate(http_requests_total{app=\"webapp\",status=~\"5..\"}[5m])) / sum(rate(http_requests_total{app=\"webapp\"}[5m]))",
                "legendFormat": "Error Rate"
              }
            ]
          },
          {
            "title": "P95 Latency",
            "type": "graph",
            "targets": [
              {
                "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{app=\"webapp\"}[5m])) by (le))",
                "legendFormat": "P95"
              }
            ]
          }
        ]
      }
    }

Troubleshooting Commands

# Check Prometheus targets
kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090
# Visit http://localhost:9090/targets

# Check Alertmanager
kubectl port-forward -n monitoring svc/alertmanager-operated 9093:9093

# View Grafana
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80

# Check metrics endpoint
kubectl exec -it webapp-pod -- curl localhost:8080/metrics

# View logs
kubectl logs -n monitoring -l app=prometheus
kubectl logs -n monitoring -l app=grafana

Key Takeaways

  • Three pillars: Logs, Metrics, and Traces provide complete observability
  • Prometheus scrapes and stores time-series metrics
  • Grafana visualizes metrics with dashboards
  • Alertmanager routes alerts to appropriate channels
  • Loki/ELK aggregates and searches logs
  • Jaeger/OpenTelemetry provides distributed tracing
  • Instrument applications with standard metrics libraries

Next Steps

Now that you understand observability, you’re ready to explore Troubleshooting in the next post. We’ll learn systematic approaches to debugging Kubernetes issues.

Resources for Further Learning


Series Navigation:

Complete Series: Kubernetes A-to-Z Series Overview