Welcome to the fourteenth post in our Kubernetes A-to-Z Series! Now that you understand observability, let’s explore Troubleshooting - systematic approaches to diagnosing and resolving issues in Kubernetes clusters.
Troubleshooting Methodology
┌─────────────────────────────────────────────────┐
│ Systematic Troubleshooting │
│ │
│ 1. Identify ──► What's the symptom? │
│ │ │
│ ▼ │
│ 2. Gather ──► Collect relevant information │
│ │ │
│ ▼ │
│ 3. Analyze ──► Find root cause │
│ │ │
│ ▼ │
│ 4. Fix ──► Apply solution │
│ │ │
│ ▼ │
│ 5. Verify ──► Confirm resolution │
│ │ │
│ ▼ │
│ 6. Document ─► Prevent recurrence │
└─────────────────────────────────────────────────┘
Pod Troubleshooting
Pod Status Overview
| Status | Meaning |
|---|---|
| Pending | Waiting to be scheduled |
| Running | At least one container running |
| Succeeded | All containers completed successfully |
| Failed | All containers terminated, at least one failed |
| Unknown | Pod state cannot be determined |
| CrashLoopBackOff | Container repeatedly crashing |
| ImagePullBackOff | Cannot pull container image |
| ErrImagePull | Error pulling image |
| CreateContainerError | Container creation failed |
| OOMKilled | Out of memory |
Debugging Pending Pods
# Check pod status
kubectl get pods
kubectl describe pod pod-name
# Common causes:
# 1. Insufficient resources
kubectl describe node | grep -A 5 "Allocated resources"
kubectl top nodes
# 2. No matching node (nodeSelector/affinity)
kubectl get pod pod-name -o yaml | grep -A 10 nodeSelector
# 3. PVC not bound
kubectl get pvc
kubectl describe pvc pvc-name
# 4. Taints and tolerations
kubectl describe node node-name | grep Taints
Debugging CrashLoopBackOff
# Check pod events
kubectl describe pod pod-name
# Check container logs
kubectl logs pod-name
kubectl logs pod-name --previous # Previous crashed instance
kubectl logs pod-name -c container-name # Specific container
# Common causes:
# 1. Application error - check logs
# 2. Missing config/secrets
kubectl get configmap
kubectl get secrets
# 3. Liveness probe failing
kubectl get pod pod-name -o yaml | grep -A 10 livenessProbe
# 4. Resource limits too low
kubectl describe pod pod-name | grep -A 5 Limits
Debugging ImagePullBackOff
# Check pod events for image pull errors
kubectl describe pod pod-name | grep -A 10 Events
# Common causes:
# 1. Wrong image name/tag
kubectl get pod pod-name -o jsonpath='{.spec.containers[*].image}'
# 2. Private registry - missing imagePullSecrets
kubectl get pod pod-name -o yaml | grep imagePullSecrets
# 3. Create registry secret
kubectl create secret docker-registry regcred \
--docker-server=registry.example.com \
--docker-username=user \
--docker-password=password
# 4. Test image pull manually
kubectl run test --image=myimage:tag --restart=Never
Debugging OOMKilled
# Check if container was OOMKilled
kubectl describe pod pod-name | grep OOMKilled
kubectl get pod pod-name -o jsonpath='{.status.containerStatuses[*].lastState}'
# Check memory usage
kubectl top pod pod-name
# Check memory limits
kubectl get pod pod-name -o yaml | grep -A 5 resources
# Solution: Increase memory limits or optimize application
Service Troubleshooting
Service Not Reachable
# 1. Check service exists
kubectl get svc service-name
kubectl describe svc service-name
# 2. Check endpoints
kubectl get endpoints service-name
# Empty endpoints = no matching pods
# 3. Verify selector matches pod labels
kubectl get svc service-name -o jsonpath='{.spec.selector}'
kubectl get pods --show-labels | grep app=webapp
# 4. Check pod readiness
kubectl get pods -l app=webapp
kubectl describe pod pod-name | grep -A 5 Conditions
# 5. Test service DNS
kubectl run test --rm -it --image=busybox -- nslookup service-name
kubectl run test --rm -it --image=busybox -- wget -qO- http://service-name
# 6. Check network policies
kubectl get networkpolicies -A
Port Connectivity Issues
# Check service ports
kubectl get svc service-name -o yaml | grep -A 10 ports
# Test from within cluster
kubectl run test --rm -it --image=nicolaka/netshoot -- bash
# Inside pod:
curl service-name:port
nc -zv service-name port
# Port forward for local testing
kubectl port-forward svc/service-name 8080:80
curl localhost:8080
Networking Troubleshooting
DNS Issues
# Check CoreDNS pods
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns
# Test DNS resolution
kubectl run dns-test --rm -it --image=busybox -- nslookup kubernetes.default
kubectl run dns-test --rm -it --image=busybox -- nslookup service-name.namespace.svc.cluster.local
# Check DNS config in pod
kubectl exec pod-name -- cat /etc/resolv.conf
# Check CoreDNS ConfigMap
kubectl get configmap coredns -n kube-system -o yaml
Network Policy Issues
# List network policies
kubectl get networkpolicies -A
kubectl describe networkpolicy policy-name
# Test connectivity between pods
kubectl exec pod-a -- wget -qO- --timeout=2 http://pod-b-service
# Temporarily remove network policy for testing
kubectl delete networkpolicy policy-name
# Test again, then restore policy
Pod-to-Pod Communication
# Get pod IPs
kubectl get pods -o wide
# Test direct pod connectivity
kubectl exec pod-a -- ping pod-b-ip
kubectl exec pod-a -- curl pod-b-ip:port
# Check CNI plugin
kubectl get pods -n kube-system | grep -E "calico|flannel|weave|cilium"
kubectl logs -n kube-system -l k8s-app=calico-node
Storage Troubleshooting
PVC Pending
# Check PVC status
kubectl get pvc
kubectl describe pvc pvc-name
# Common causes:
# 1. No matching PV
kubectl get pv
# 2. StorageClass doesn't exist or no provisioner
kubectl get storageclass
kubectl describe storageclass sc-name
# 3. Insufficient storage
kubectl describe pv | grep Capacity
# 4. Access mode mismatch
kubectl get pvc pvc-name -o yaml | grep accessModes
kubectl get pv pv-name -o yaml | grep accessModes
Volume Mount Issues
# Check mount errors
kubectl describe pod pod-name | grep -A 10 Events
# Check if volume is mounted
kubectl exec pod-name -- df -h
kubectl exec pod-name -- ls -la /mount/path
# Check permissions
kubectl exec pod-name -- ls -la /mount/path
kubectl exec pod-name -- touch /mount/path/test
# For read-only filesystem errors
kubectl get pod pod-name -o yaml | grep -A 5 volumeMounts
Node Troubleshooting
Node Not Ready
# Check node status
kubectl get nodes
kubectl describe node node-name
# Check node conditions
kubectl get node node-name -o jsonpath='{.status.conditions}' | jq
# Common issues:
# 1. Kubelet not running
systemctl status kubelet
journalctl -u kubelet -f
# 2. Disk pressure
df -h
kubectl describe node node-name | grep -A 5 Conditions
# 3. Memory pressure
free -m
kubectl top node node-name
# 4. Network issues
kubectl get node node-name -o jsonpath='{.status.addresses}'
Node Resource Exhaustion
# Check node resources
kubectl top nodes
kubectl describe node node-name | grep -A 10 "Allocated resources"
# Identify resource-heavy pods
kubectl top pods -A --sort-by=memory
kubectl top pods -A --sort-by=cpu
# Check for pods without limits
kubectl get pods -A -o json | jq '.items[] | select(.spec.containers[].resources.limits == null) | .metadata.name'
# Evict pods if needed (drain node)
kubectl drain node-name --ignore-daemonsets --delete-emptydir-data
Deployment Troubleshooting
Deployment Not Progressing
# Check deployment status
kubectl get deployment deployment-name
kubectl describe deployment deployment-name
# Check rollout status
kubectl rollout status deployment/deployment-name
# Check ReplicaSet
kubectl get rs -l app=webapp
kubectl describe rs rs-name
# Check events
kubectl get events --sort-by='.lastTimestamp' | grep deployment-name
# Common issues:
# 1. Image pull failures
# 2. Resource quota exceeded
kubectl describe quota -n namespace
# 3. Pod disruption budget blocking
kubectl get pdb
Failed Rolling Update
# Check rollout history
kubectl rollout history deployment/deployment-name
# Check current vs desired replicas
kubectl get deployment deployment-name -o jsonpath='{.status}'
# Rollback to previous version
kubectl rollout undo deployment/deployment-name
# Rollback to specific revision
kubectl rollout undo deployment/deployment-name --to-revision=2
# Pause/resume rollout
kubectl rollout pause deployment/deployment-name
kubectl rollout resume deployment/deployment-name
Cluster-Level Troubleshooting
Control Plane Issues
# Check control plane components
kubectl get pods -n kube-system
# Check API server
kubectl cluster-info
curl -k https://kubernetes-api:6443/healthz
# Check etcd
kubectl get pods -n kube-system -l component=etcd
kubectl exec -it -n kube-system etcd-master -- etcdctl endpoint health
# Check scheduler
kubectl get pods -n kube-system -l component=kube-scheduler
kubectl logs -n kube-system -l component=kube-scheduler
# Check controller-manager
kubectl get pods -n kube-system -l component=kube-controller-manager
kubectl logs -n kube-system -l component=kube-controller-manager
API Server Connectivity
# Test API server
kubectl cluster-info
kubectl get --raw /healthz
kubectl get --raw /readyz
# Check API server logs
kubectl logs -n kube-system kube-apiserver-master
# Check certificates
kubectl get csr
openssl x509 -in /etc/kubernetes/pki/apiserver.crt -text -noout
Essential Debugging Commands
# Pod debugging
kubectl get pods -A -o wide
kubectl describe pod pod-name
kubectl logs pod-name [-c container] [--previous]
kubectl exec -it pod-name -- /bin/sh
kubectl debug pod-name -it --image=nicolaka/netshoot
# Service debugging
kubectl get svc,ep
kubectl describe svc service-name
kubectl port-forward svc/service-name local:remote
# Node debugging
kubectl get nodes -o wide
kubectl describe node node-name
kubectl top nodes
# Events
kubectl get events --sort-by='.lastTimestamp'
kubectl get events --field-selector type=Warning
# Resource usage
kubectl top pods -A
kubectl top nodes
# YAML output for inspection
kubectl get pod pod-name -o yaml
kubectl get deployment deployment-name -o yaml
Debugging Toolkit Pod
# debug-pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: debug-pod
spec:
containers:
- name: debug
image: nicolaka/netshoot
command: ["sleep", "infinity"]
securityContext:
capabilities:
add: ["NET_ADMIN", "SYS_TIME"]
# Deploy and use
kubectl apply -f debug-pod.yaml
kubectl exec -it debug-pod -- bash
# Inside debug pod:
curl service-name
nslookup service-name
ping pod-ip
tcpdump -i any port 80
netstat -tulpn
Common Quick Fixes
# Force delete stuck pod
kubectl delete pod pod-name --force --grace-period=0
# Restart deployment
kubectl rollout restart deployment/deployment-name
# Scale to zero and back
kubectl scale deployment deployment-name --replicas=0
kubectl scale deployment deployment-name --replicas=3
# Recreate pods
kubectl delete pods -l app=webapp
# Clear failed jobs
kubectl delete jobs --field-selector status.successful=0
Key Takeaways
- Systematic approach: Identify, Gather, Analyze, Fix, Verify, Document
- kubectl describe is your best friend for debugging
- Events often reveal the root cause
- Logs provide application-level insights
- Network troubleshooting requires testing at multiple layers
- Debug pods with networking tools help diagnose connectivity
- Always check resources: CPU, memory, storage, quotas
Next Steps
Now that you understand troubleshooting, you’re ready to explore Quality Assurance in the next post. We’ll learn testing strategies and chaos engineering for Kubernetes applications.