[15/24] L is for Logging and Monitoring: Observability in Kubernetes
This is Post #13 in the Kubernetes A-to-Z Series
Reading Order: Previous: YAML | Next: Troubleshooting
Series Progress: 15/24 complete | Difficulty: Intermediate | Time: 30 min | Part 5/6: Operations
Welcome to the thirteenth post in our Kubernetes A-to-Z Series! Now that you understand Operators, let’s explore Logging and Monitoring - the foundation of observability in Kubernetes. Understanding what’s happening inside your cluster is essential for maintaining reliable applications.
The Three Pillars of Observability
┌─────────────────────────────────────────────────┐
│ Observability │
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Logs │ │ Metrics │ │ Traces │ │
│ │ │ │ │ │ │ │
│ │ What │ │ How │ │ Where │ │
│ │happened │ │ much │ │ requests│ │
│ │ │ │ │ │ flow │ │
│ └─────────┘ └─────────┘ └─────────┘ │
│ │
│ Together: Complete system visibility │
└─────────────────────────────────────────────────┘
Why Observability Matters
- Troubleshooting: Quickly identify and resolve issues
- Performance: Detect bottlenecks and optimize
- Reliability: Proactive alerting before failures
- Capacity Planning: Data-driven scaling decisions
- Security: Detect anomalies and threats
Logging in Kubernetes
Logging Architecture
┌─────────────────────────────────────────────────┐
│ Kubernetes Logging │
│ │
│ ┌─────────┐ │
│ │ Pod │──► stdout/stderr │
│ └─────────┘ │ │
│ ▼ │
│ ┌─────────────────────────────────────┐ │
│ │ Node (kubelet) │ │
│ │ /var/log/containers/*.log │ │
│ └──────────────┬──────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────┐ │
│ │ Log Collector (Fluentd/Fluent Bit) │ │
│ └──────────────┬──────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────┐ │
│ │ Storage (Elasticsearch/Loki) │ │
│ └─────────────────────────────────────┘ │
└─────────────────────────────────────────────────┘
Basic Log Commands
# View pod logs
kubectl logs pod-name
kubectl logs pod-name -c container-name # specific container
kubectl logs pod-name --previous # previous instance
kubectl logs pod-name -f # follow logs
kubectl logs pod-name --tail=100 # last 100 lines
kubectl logs pod-name --since=1h # last hour
# View logs from multiple pods
kubectl logs -l app=webapp --all-containers
kubectl logs -l app=webapp -f
# View logs from deployment
kubectl logs deployment/webapp
Fluent Bit DaemonSet
# fluent-bit-daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluent-bit
namespace: logging
spec:
selector:
matchLabels:
app: fluent-bit
template:
metadata:
labels:
app: fluent-bit
spec:
serviceAccountName: fluent-bit
containers:
- name: fluent-bit
image: fluent/fluent-bit:2.1
volumeMounts:
- name: varlog
mountPath: /var/log
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
- name: config
mountPath: /fluent-bit/etc/
volumes:
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
- name: config
configMap:
name: fluent-bit-config
---
apiVersion: v1
kind: ConfigMap
metadata:
name: fluent-bit-config
namespace: logging
data:
fluent-bit.conf: |
[SERVICE]
Flush 1
Log_Level info
Parsers_File parsers.conf
[INPUT]
Name tail
Path /var/log/containers/*.log
Parser docker
Tag kube.*
Refresh_Interval 5
[FILTER]
Name kubernetes
Match kube.*
Kube_URL https://kubernetes.default.svc:443
Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token
Merge_Log On
[OUTPUT]
Name es
Match *
Host elasticsearch.logging.svc
Port 9200
Index kubernetes
Type _doc
parsers.conf: |
[PARSER]
Name docker
Format json
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%L
Loki for Log Aggregation
# loki-stack.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: loki-config
namespace: logging
data:
loki.yaml: |
auth_enabled: false
server:
http_listen_port: 3100
ingester:
lifecycler:
ring:
kvstore:
store: inmemory
replication_factor: 1
schema_config:
configs:
- from: 2020-01-01
store: boltdb
object_store: filesystem
schema: v11
index:
prefix: index_
period: 24h
storage_config:
boltdb:
directory: /loki/index
filesystem:
directory: /loki/chunks
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: loki
namespace: logging
spec:
replicas: 1
selector:
matchLabels:
app: loki
template:
metadata:
labels:
app: loki
spec:
containers:
- name: loki
image: grafana/loki:2.8.0
ports:
- containerPort: 3100
volumeMounts:
- name: config
mountPath: /etc/loki
- name: storage
mountPath: /loki
volumes:
- name: config
configMap:
name: loki-config
- name: storage
persistentVolumeClaim:
claimName: loki-pvc
Metrics with Prometheus
Prometheus Architecture
┌─────────────────────────────────────────────────┐
│ Prometheus Stack │
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ App Pod │ │ App Pod │ │ App Pod │ │
│ │ /metrics│ │ /metrics│ │ /metrics│ │
│ └────┬────┘ └────┬────┘ └────┬────┘ │
│ │ │ │ │
│ └────────────┼────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────┐ │
│ │ Prometheus Server │ │
│ │ - Scrapes metrics │ │
│ │ - Stores time-series data │ │
│ │ - Evaluates alert rules │ │
│ └──────────────┬──────────────────────┘ │
│ │ │
│ ┌─────────┴─────────┐ │
│ ▼ ▼ │
│ ┌─────────┐ ┌─────────┐ │
│ │ Grafana │ │Alertmgr │ │
│ └─────────┘ └─────────┘ │
└─────────────────────────────────────────────────┘
Installing Prometheus Stack
# Using Helm
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set grafana.adminPassword=admin123
ServiceMonitor for Application Metrics
# servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: webapp-monitor
namespace: monitoring
labels:
release: prometheus
spec:
namespaceSelector:
matchNames:
- production
selector:
matchLabels:
app: webapp
endpoints:
- port: metrics
interval: 30s
path: /metrics
scrapeTimeout: 10s
Prometheus Rules and Alerts
# prometheus-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: webapp-alerts
namespace: monitoring
labels:
release: prometheus
spec:
groups:
- name: webapp.rules
rules:
# Recording rule
- record: webapp:request_rate:5m
expr: rate(http_requests_total{app="webapp"}[5m])
# Alert rules
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{app="webapp",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{app="webapp"}[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate on {{ $labels.instance }}"
description: "Error rate is {{ $value | humanizePercentage }}"
- alert: HighLatency
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket{app="webapp"}[5m])) by (le)
) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High latency detected"
description: "P95 latency is {{ $value }}s"
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.pod }} is crash looping"
- alert: HighMemoryUsage
expr: |
container_memory_usage_bytes{container!=""}
/
container_spec_memory_limit_bytes{container!=""} > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.pod }}"
Alertmanager Configuration
# alertmanager-config.yaml
apiVersion: v1
kind: Secret
metadata:
name: alertmanager-config
namespace: monitoring
stringData:
alertmanager.yaml: |
global:
resolve_timeout: 5m
slack_api_url: 'https://hooks.slack.com/services/xxx'
route:
group_by: ['alertname', 'namespace']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'pagerduty'
- match:
severity: warning
receiver: 'slack'
receivers:
- name: 'default'
slack_configs:
- channel: '#alerts'
send_resolved: true
- name: 'slack'
slack_configs:
- channel: '#alerts-warning'
send_resolved: true
title: '{{ .Status | toUpper }}: {{ .CommonAnnotations.summary }}'
text: '{{ .CommonAnnotations.description }}'
- name: 'pagerduty'
pagerduty_configs:
- service_key: '<pagerduty-key>'
send_resolved: true
Application Instrumentation
Go Application
package main
import (
"net/http"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var (
httpRequestsTotal = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total HTTP requests",
},
[]string{"method", "path", "status"},
)
httpRequestDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request duration",
Buckets: prometheus.DefBuckets,
},
[]string{"method", "path"},
)
)
func init() {
prometheus.MustRegister(httpRequestsTotal)
prometheus.MustRegister(httpRequestDuration)
}
func main() {
http.Handle("/metrics", promhttp.Handler())
http.ListenAndServe(":8080", nil)
}
Node.js Application
const client = require('prom-client');
const express = require('express');
const app = express();
// Create metrics
const httpRequestsTotal = new client.Counter({
name: 'http_requests_total',
help: 'Total HTTP requests',
labelNames: ['method', 'path', 'status']
});
const httpRequestDuration = new client.Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request duration',
labelNames: ['method', 'path'],
buckets: [0.1, 0.5, 1, 2, 5]
});
// Middleware
app.use((req, res, next) => {
const start = Date.now();
res.on('finish', () => {
const duration = (Date.now() - start) / 1000;
httpRequestsTotal.inc({ method: req.method, path: req.path, status: res.statusCode });
httpRequestDuration.observe({ method: req.method, path: req.path }, duration);
});
next();
});
// Metrics endpoint
app.get('/metrics', async (req, res) => {
res.set('Content-Type', client.register.contentType);
res.end(await client.register.metrics());
});
app.listen(8080);
Distributed Tracing with Jaeger
Jaeger Deployment
# jaeger.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: jaeger
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: jaeger
template:
metadata:
labels:
app: jaeger
spec:
containers:
- name: jaeger
image: jaegertracing/all-in-one:1.47
ports:
- containerPort: 16686 # UI
- containerPort: 14268 # HTTP collector
- containerPort: 6831 # UDP agent
protocol: UDP
env:
- name: COLLECTOR_ZIPKIN_HOST_PORT
value: ":9411"
---
apiVersion: v1
kind: Service
metadata:
name: jaeger
namespace: monitoring
spec:
selector:
app: jaeger
ports:
- name: ui
port: 16686
- name: collector
port: 14268
- name: agent
port: 6831
protocol: UDP
OpenTelemetry Instrumentation
# otel-collector.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-collector-config
namespace: monitoring
data:
config.yaml: |
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 1s
send_batch_size: 1024
exporters:
jaeger:
endpoint: jaeger:14250
tls:
insecure: true
prometheus:
endpoint: 0.0.0.0:8889
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [jaeger]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]
Grafana Dashboards
Dashboard ConfigMap
# grafana-dashboard.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: webapp-dashboard
namespace: monitoring
labels:
grafana_dashboard: "1"
data:
webapp-dashboard.json: |
{
"dashboard": {
"title": "Web Application Dashboard",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "sum(rate(http_requests_total{app=\"webapp\"}[5m]))",
"legendFormat": "Requests/s"
}
]
},
{
"title": "Error Rate",
"type": "graph",
"targets": [
{
"expr": "sum(rate(http_requests_total{app=\"webapp\",status=~\"5..\"}[5m])) / sum(rate(http_requests_total{app=\"webapp\"}[5m]))",
"legendFormat": "Error Rate"
}
]
},
{
"title": "P95 Latency",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{app=\"webapp\"}[5m])) by (le))",
"legendFormat": "P95"
}
]
}
]
}
}
Troubleshooting Commands
# Check Prometheus targets
kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090
# Visit http://localhost:9090/targets
# Check Alertmanager
kubectl port-forward -n monitoring svc/alertmanager-operated 9093:9093
# View Grafana
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
# Check metrics endpoint
kubectl exec -it webapp-pod -- curl localhost:8080/metrics
# View logs
kubectl logs -n monitoring -l app=prometheus
kubectl logs -n monitoring -l app=grafana
Key Takeaways
- Three pillars: Logs, Metrics, and Traces provide complete observability
- Prometheus scrapes and stores time-series metrics
- Grafana visualizes metrics with dashboards
- Alertmanager routes alerts to appropriate channels
- Loki/ELK aggregates and searches logs
- Jaeger/OpenTelemetry provides distributed tracing
- Instrument applications with standard metrics libraries
Next Steps
Now that you understand observability, you’re ready to explore Troubleshooting in the next post. We’ll learn systematic approaches to debugging Kubernetes issues.
Resources for Further Learning
Series Navigation:
- Previous: Y is for YAML: Mastering the Language
- Next: T is for Troubleshooting: Common Issues and Solutions
Complete Series: Kubernetes A-to-Z Series Overview