Welcome to the thirteenth post in our Kubernetes A-to-Z Series! Now that you understand Operators, let’s explore Logging and Monitoring - the foundation of observability in Kubernetes. Understanding what’s happening inside your cluster is essential for maintaining reliable applications.
The Three Pillars of Observability
┌─────────────────────────────────────────────────┐
│ Observability │
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Logs │ │ Metrics │ │ Traces │ │
│ │ │ │ │ │ │ │
│ │ What │ │ How │ │ Where │ │
│ │happened │ │ much │ │ requests│ │
│ │ │ │ │ │ flow │ │
│ └─────────┘ └─────────┘ └─────────┘ │
│ │
│ Together: Complete system visibility │
└─────────────────────────────────────────────────┘
Why Observability Matters
- Troubleshooting: Quickly identify and resolve issues
- Performance: Detect bottlenecks and optimize
- Reliability: Proactive alerting before failures
- Capacity Planning: Data-driven scaling decisions
- Security: Detect anomalies and threats
Logging in Kubernetes
Logging Architecture
┌─────────────────────────────────────────────────┐
│ Kubernetes Logging │
│ │
│ ┌─────────┐ │
│ │ Pod │──► stdout/stderr │
│ └─────────┘ │ │
│ ▼ │
│ ┌─────────────────────────────────────┐ │
│ │ Node (kubelet) │ │
│ │ /var/log/containers/*.log │ │
│ └──────────────┬──────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────┐ │
│ │ Log Collector (Fluentd/Fluent Bit) │ │
│ └──────────────┬──────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────┐ │
│ │ Storage (Elasticsearch/Loki) │ │
│ └─────────────────────────────────────┘ │
└─────────────────────────────────────────────────┘
Basic Log Commands
# View pod logs
kubectl logs pod-name
kubectl logs pod-name -c container-name # specific container
kubectl logs pod-name --previous # previous instance
kubectl logs pod-name -f # follow logs
kubectl logs pod-name --tail=100 # last 100 lines
kubectl logs pod-name --since=1h # last hour
# View logs from multiple pods
kubectl logs -l app=webapp --all-containers
kubectl logs -l app=webapp -f
# View logs from deployment
kubectl logs deployment/webapp
Fluent Bit DaemonSet
# fluent-bit-daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluent-bit
namespace: logging
spec:
selector:
matchLabels:
app: fluent-bit
template:
metadata:
labels:
app: fluent-bit
spec:
serviceAccountName: fluent-bit
containers:
- name: fluent-bit
image: fluent/fluent-bit:2.1
volumeMounts:
- name: varlog
mountPath: /var/log
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
- name: config
mountPath: /fluent-bit/etc/
volumes:
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
- name: config
configMap:
name: fluent-bit-config
---
apiVersion: v1
kind: ConfigMap
metadata:
name: fluent-bit-config
namespace: logging
data:
fluent-bit.conf: |
[SERVICE]
Flush 1
Log_Level info
Parsers_File parsers.conf
[INPUT]
Name tail
Path /var/log/containers/*.log
Parser docker
Tag kube.*
Refresh_Interval 5
[FILTER]
Name kubernetes
Match kube.*
Kube_URL https://kubernetes.default.svc:443
Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token
Merge_Log On
[OUTPUT]
Name es
Match *
Host elasticsearch.logging.svc
Port 9200
Index kubernetes
Type _doc
parsers.conf: |
[PARSER]
Name docker
Format json
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%L
Loki for Log Aggregation
# loki-stack.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: loki-config
namespace: logging
data:
loki.yaml: |
auth_enabled: false
server:
http_listen_port: 3100
ingester:
lifecycler:
ring:
kvstore:
store: inmemory
replication_factor: 1
schema_config:
configs:
- from: 2020-01-01
store: boltdb
object_store: filesystem
schema: v11
index:
prefix: index_
period: 24h
storage_config:
boltdb:
directory: /loki/index
filesystem:
directory: /loki/chunks
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: loki
namespace: logging
spec:
replicas: 1
selector:
matchLabels:
app: loki
template:
metadata:
labels:
app: loki
spec:
containers:
- name: loki
image: grafana/loki:2.8.0
ports:
- containerPort: 3100
volumeMounts:
- name: config
mountPath: /etc/loki
- name: storage
mountPath: /loki
volumes:
- name: config
configMap:
name: loki-config
- name: storage
persistentVolumeClaim:
claimName: loki-pvc
Metrics with Prometheus
Prometheus Architecture
┌─────────────────────────────────────────────────┐
│ Prometheus Stack │
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ App Pod │ │ App Pod │ │ App Pod │ │
│ │ /metrics│ │ /metrics│ │ /metrics│ │
│ └────┬────┘ └────┬────┘ └────┬────┘ │
│ │ │ │ │
│ └────────────┼────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────┐ │
│ │ Prometheus Server │ │
│ │ - Scrapes metrics │ │
│ │ - Stores time-series data │ │
│ │ - Evaluates alert rules │ │
│ └──────────────┬──────────────────────┘ │
│ │ │
│ ┌─────────┴─────────┐ │
│ ▼ ▼ │
│ ┌─────────┐ ┌─────────┐ │
│ │ Grafana │ │Alertmgr │ │
│ └─────────┘ └─────────┘ │
└─────────────────────────────────────────────────┘
Installing Prometheus Stack
# Using Helm
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set grafana.adminPassword=admin123
ServiceMonitor for Application Metrics
# servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: webapp-monitor
namespace: monitoring
labels:
release: prometheus
spec:
namespaceSelector:
matchNames:
- production
selector:
matchLabels:
app: webapp
endpoints:
- port: metrics
interval: 30s
path: /metrics
scrapeTimeout: 10s
Prometheus Rules and Alerts
# prometheus-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: webapp-alerts
namespace: monitoring
labels:
release: prometheus
spec:
groups:
- name: webapp.rules
rules:
# Recording rule
- record: webapp:request_rate:5m
expr: rate(http_requests_total{app="webapp"}[5m])
# Alert rules
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{app="webapp",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{app="webapp"}[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate on {{ $labels.instance }}"
description: "Error rate is {{ $value | humanizePercentage }}"
- alert: HighLatency
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket{app="webapp"}[5m])) by (le)
) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High latency detected"
description: "P95 latency is {{ $value }}s"
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.pod }} is crash looping"
- alert: HighMemoryUsage
expr: |
container_memory_usage_bytes{container!=""}
/
container_spec_memory_limit_bytes{container!=""} > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.pod }}"
Alertmanager Configuration
# alertmanager-config.yaml
apiVersion: v1
kind: Secret
metadata:
name: alertmanager-config
namespace: monitoring
stringData:
alertmanager.yaml: |
global:
resolve_timeout: 5m
slack_api_url: 'https://hooks.slack.com/services/xxx'
route:
group_by: ['alertname', 'namespace']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'pagerduty'
- match:
severity: warning
receiver: 'slack'
receivers:
- name: 'default'
slack_configs:
- channel: '#alerts'
send_resolved: true
- name: 'slack'
slack_configs:
- channel: '#alerts-warning'
send_resolved: true
title: '{{ .Status | toUpper }}: {{ .CommonAnnotations.summary }}'
text: '{{ .CommonAnnotations.description }}'
- name: 'pagerduty'
pagerduty_configs:
- service_key: '<pagerduty-key>'
send_resolved: true
Application Instrumentation
Go Application
package main
import (
"net/http"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var (
httpRequestsTotal = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total HTTP requests",
},
[]string{"method", "path", "status"},
)
httpRequestDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request duration",
Buckets: prometheus.DefBuckets,
},
[]string{"method", "path"},
)
)
func init() {
prometheus.MustRegister(httpRequestsTotal)
prometheus.MustRegister(httpRequestDuration)
}
func main() {
http.Handle("/metrics", promhttp.Handler())
http.ListenAndServe(":8080", nil)
}
Node.js Application
const client = require('prom-client');
const express = require('express');
const app = express();
// Create metrics
const httpRequestsTotal = new client.Counter({
name: 'http_requests_total',
help: 'Total HTTP requests',
labelNames: ['method', 'path', 'status']
});
const httpRequestDuration = new client.Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request duration',
labelNames: ['method', 'path'],
buckets: [0.1, 0.5, 1, 2, 5]
});
// Middleware
app.use((req, res, next) => {
const start = Date.now();
res.on('finish', () => {
const duration = (Date.now() - start) / 1000;
httpRequestsTotal.inc({ method: req.method, path: req.path, status: res.statusCode });
httpRequestDuration.observe({ method: req.method, path: req.path }, duration);
});
next();
});
// Metrics endpoint
app.get('/metrics', async (req, res) => {
res.set('Content-Type', client.register.contentType);
res.end(await client.register.metrics());
});
app.listen(8080);
Distributed Tracing with Jaeger
Jaeger Deployment
# jaeger.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: jaeger
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: jaeger
template:
metadata:
labels:
app: jaeger
spec:
containers:
- name: jaeger
image: jaegertracing/all-in-one:1.47
ports:
- containerPort: 16686 # UI
- containerPort: 14268 # HTTP collector
- containerPort: 6831 # UDP agent
protocol: UDP
env:
- name: COLLECTOR_ZIPKIN_HOST_PORT
value: ":9411"
---
apiVersion: v1
kind: Service
metadata:
name: jaeger
namespace: monitoring
spec:
selector:
app: jaeger
ports:
- name: ui
port: 16686
- name: collector
port: 14268
- name: agent
port: 6831
protocol: UDP
OpenTelemetry Instrumentation
# otel-collector.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-collector-config
namespace: monitoring
data:
config.yaml: |
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 1s
send_batch_size: 1024
exporters:
jaeger:
endpoint: jaeger:14250
tls:
insecure: true
prometheus:
endpoint: 0.0.0.0:8889
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [jaeger]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]
Grafana Dashboards
Dashboard ConfigMap
# grafana-dashboard.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: webapp-dashboard
namespace: monitoring
labels:
grafana_dashboard: "1"
data:
webapp-dashboard.json: |
{
"dashboard": {
"title": "Web Application Dashboard",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "sum(rate(http_requests_total{app=\"webapp\"}[5m]))",
"legendFormat": "Requests/s"
}
]
},
{
"title": "Error Rate",
"type": "graph",
"targets": [
{
"expr": "sum(rate(http_requests_total{app=\"webapp\",status=~\"5..\"}[5m])) / sum(rate(http_requests_total{app=\"webapp\"}[5m]))",
"legendFormat": "Error Rate"
}
]
},
{
"title": "P95 Latency",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{app=\"webapp\"}[5m])) by (le))",
"legendFormat": "P95"
}
]
}
]
}
}
Troubleshooting Commands
# Check Prometheus targets
kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090
# Visit http://localhost:9090/targets
# Check Alertmanager
kubectl port-forward -n monitoring svc/alertmanager-operated 9093:9093
# View Grafana
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
# Check metrics endpoint
kubectl exec -it webapp-pod -- curl localhost:8080/metrics
# View logs
kubectl logs -n monitoring -l app=prometheus
kubectl logs -n monitoring -l app=grafana
Key Takeaways
- Three pillars: Logs, Metrics, and Traces provide complete observability
- Prometheus scrapes and stores time-series metrics
- Grafana visualizes metrics with dashboards
- Alertmanager routes alerts to appropriate channels
- Loki/ELK aggregates and searches logs
- Jaeger/OpenTelemetry provides distributed tracing
- Instrument applications with standard metrics libraries
Next Steps
Now that you understand observability, you’re ready to explore Troubleshooting in the next post. We’ll learn systematic approaches to debugging Kubernetes issues.