[16/24] T is for Troubleshooting: Common Issues and Solutions

This is Post #14 in the Kubernetes A-to-Z Series

Reading Order: Previous: Logging and Monitoring | Next: Quality Assurance

Series Progress: 16/24 complete | Difficulty: Intermediate | Time: 45 min | Part 5/6: Operations

Welcome to the fourteenth post in our Kubernetes A-to-Z Series! Now that you understand observability, let’s explore Troubleshooting - systematic approaches to diagnosing and resolving issues in Kubernetes clusters.

Troubleshooting Methodology

┌─────────────────────────────────────────────────┐
│  Systematic Troubleshooting                     │
│                                                 │
│  1. Identify ──► What's the symptom?            │
│       │                                         │
│       ▼                                         │
│  2. Gather  ──► Collect relevant information    │
│       │                                         │
│       ▼                                         │
│  3. Analyze ──► Find root cause                 │
│       │                                         │
│       ▼                                         │
│  4. Fix     ──► Apply solution                  │
│       │                                         │
│       ▼                                         │
│  5. Verify  ──► Confirm resolution              │
│       │                                         │
│       ▼                                         │
│  6. Document ─► Prevent recurrence              │
└─────────────────────────────────────────────────┘

Pod Troubleshooting

Pod Status Overview

Status	Meaning
Pending	Waiting to be scheduled
Running	At least one container running
Succeeded	All containers completed successfully
Failed	All containers terminated, at least one failed
Unknown	Pod state cannot be determined
CrashLoopBackOff	Container repeatedly crashing
ImagePullBackOff	Cannot pull container image
ErrImagePull	Error pulling image
CreateContainerError	Container creation failed
OOMKilled	Out of memory

Debugging Pending Pods

# Check pod status
kubectl get pods
kubectl describe pod pod-name

# Common causes:
# 1. Insufficient resources
kubectl describe node | grep -A 5 "Allocated resources"
kubectl top nodes

# 2. No matching node (nodeSelector/affinity)
kubectl get pod pod-name -o yaml | grep -A 10 nodeSelector

# 3. PVC not bound
kubectl get pvc
kubectl describe pvc pvc-name

# 4. Taints and tolerations
kubectl describe node node-name | grep Taints

Debugging CrashLoopBackOff

# Check pod events
kubectl describe pod pod-name

# Check container logs
kubectl logs pod-name
kubectl logs pod-name --previous  # Previous crashed instance
kubectl logs pod-name -c container-name  # Specific container

# Common causes:
# 1. Application error - check logs
# 2. Missing config/secrets
kubectl get configmap
kubectl get secrets

# 3. Liveness probe failing
kubectl get pod pod-name -o yaml | grep -A 10 livenessProbe

# 4. Resource limits too low
kubectl describe pod pod-name | grep -A 5 Limits

Debugging ImagePullBackOff

# Check pod events for image pull errors
kubectl describe pod pod-name | grep -A 10 Events

# Common causes:
# 1. Wrong image name/tag
kubectl get pod pod-name -o jsonpath='{.spec.containers[*].image}'

# 2. Private registry - missing imagePullSecrets
kubectl get pod pod-name -o yaml | grep imagePullSecrets

# 3. Create registry secret
kubectl create secret docker-registry regcred \
  --docker-server=registry.example.com \
  --docker-username=user \
  --docker-password=password

# 4. Test image pull manually
kubectl run test --image=myimage:tag --restart=Never

Debugging OOMKilled

# Check if container was OOMKilled
kubectl describe pod pod-name | grep OOMKilled
kubectl get pod pod-name -o jsonpath='{.status.containerStatuses[*].lastState}'

# Check memory usage
kubectl top pod pod-name

# Check memory limits
kubectl get pod pod-name -o yaml | grep -A 5 resources

# Solution: Increase memory limits or optimize application

Service Troubleshooting

Service Not Reachable

# 1. Check service exists
kubectl get svc service-name
kubectl describe svc service-name

# 2. Check endpoints
kubectl get endpoints service-name
# Empty endpoints = no matching pods

# 3. Verify selector matches pod labels
kubectl get svc service-name -o jsonpath='{.spec.selector}'
kubectl get pods --show-labels | grep app=webapp

# 4. Check pod readiness
kubectl get pods -l app=webapp
kubectl describe pod pod-name | grep -A 5 Conditions

# 5. Test service DNS
kubectl run test --rm -it --image=busybox -- nslookup service-name
kubectl run test --rm -it --image=busybox -- wget -qO- http://service-name

# 6. Check network policies
kubectl get networkpolicies -A

Port Connectivity Issues

# Check service ports
kubectl get svc service-name -o yaml | grep -A 10 ports

# Test from within cluster
kubectl run test --rm -it --image=nicolaka/netshoot -- bash
# Inside pod:
curl service-name:port
nc -zv service-name port

# Port forward for local testing
kubectl port-forward svc/service-name 8080:80
curl localhost:8080

Networking Troubleshooting

DNS Issues

# Check CoreDNS pods
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns

# Test DNS resolution
kubectl run dns-test --rm -it --image=busybox -- nslookup kubernetes.default
kubectl run dns-test --rm -it --image=busybox -- nslookup service-name.namespace.svc.cluster.local

# Check DNS config in pod
kubectl exec pod-name -- cat /etc/resolv.conf

# Check CoreDNS ConfigMap
kubectl get configmap coredns -n kube-system -o yaml

Network Policy Issues

# List network policies
kubectl get networkpolicies -A
kubectl describe networkpolicy policy-name

# Test connectivity between pods
kubectl exec pod-a -- wget -qO- --timeout=2 http://pod-b-service

# Temporarily remove network policy for testing
kubectl delete networkpolicy policy-name
# Test again, then restore policy

Pod-to-Pod Communication

# Get pod IPs
kubectl get pods -o wide

# Test direct pod connectivity
kubectl exec pod-a -- ping pod-b-ip
kubectl exec pod-a -- curl pod-b-ip:port

# Check CNI plugin
kubectl get pods -n kube-system | grep -E "calico|flannel|weave|cilium"
kubectl logs -n kube-system -l k8s-app=calico-node

Storage Troubleshooting

PVC Pending

# Check PVC status
kubectl get pvc
kubectl describe pvc pvc-name

# Common causes:
# 1. No matching PV
kubectl get pv

# 2. StorageClass doesn't exist or no provisioner
kubectl get storageclass
kubectl describe storageclass sc-name

# 3. Insufficient storage
kubectl describe pv | grep Capacity

# 4. Access mode mismatch
kubectl get pvc pvc-name -o yaml | grep accessModes
kubectl get pv pv-name -o yaml | grep accessModes

Volume Mount Issues

# Check mount errors
kubectl describe pod pod-name | grep -A 10 Events

# Check if volume is mounted
kubectl exec pod-name -- df -h
kubectl exec pod-name -- ls -la /mount/path

# Check permissions
kubectl exec pod-name -- ls -la /mount/path
kubectl exec pod-name -- touch /mount/path/test

# For read-only filesystem errors
kubectl get pod pod-name -o yaml | grep -A 5 volumeMounts

Node Troubleshooting

Node Not Ready

# Check node status
kubectl get nodes
kubectl describe node node-name

# Check node conditions
kubectl get node node-name -o jsonpath='{.status.conditions}' | jq

# Common issues:
# 1. Kubelet not running
systemctl status kubelet
journalctl -u kubelet -f

# 2. Disk pressure
df -h
kubectl describe node node-name | grep -A 5 Conditions

# 3. Memory pressure
free -m
kubectl top node node-name

# 4. Network issues
kubectl get node node-name -o jsonpath='{.status.addresses}'

Node Resource Exhaustion

# Check node resources
kubectl top nodes
kubectl describe node node-name | grep -A 10 "Allocated resources"

# Identify resource-heavy pods
kubectl top pods -A --sort-by=memory
kubectl top pods -A --sort-by=cpu

# Check for pods without limits
kubectl get pods -A -o json | jq '.items[] | select(.spec.containers[].resources.limits == null) | .metadata.name'

# Evict pods if needed (drain node)
kubectl drain node-name --ignore-daemonsets --delete-emptydir-data

Deployment Troubleshooting

Deployment Not Progressing

# Check deployment status
kubectl get deployment deployment-name
kubectl describe deployment deployment-name

# Check rollout status
kubectl rollout status deployment/deployment-name

# Check ReplicaSet
kubectl get rs -l app=webapp
kubectl describe rs rs-name

# Check events
kubectl get events --sort-by='.lastTimestamp' | grep deployment-name

# Common issues:
# 1. Image pull failures
# 2. Resource quota exceeded
kubectl describe quota -n namespace

# 3. Pod disruption budget blocking
kubectl get pdb

Failed Rolling Update

# Check rollout history
kubectl rollout history deployment/deployment-name

# Check current vs desired replicas
kubectl get deployment deployment-name -o jsonpath='{.status}'

# Rollback to previous version
kubectl rollout undo deployment/deployment-name

# Rollback to specific revision
kubectl rollout undo deployment/deployment-name --to-revision=2

# Pause/resume rollout
kubectl rollout pause deployment/deployment-name
kubectl rollout resume deployment/deployment-name

Cluster-Level Troubleshooting

Control Plane Issues

# Check control plane components
kubectl get pods -n kube-system

# Check API server
kubectl cluster-info
curl -k https://kubernetes-api:6443/healthz

# Check etcd
kubectl get pods -n kube-system -l component=etcd
kubectl exec -it -n kube-system etcd-master -- etcdctl endpoint health

# Check scheduler
kubectl get pods -n kube-system -l component=kube-scheduler
kubectl logs -n kube-system -l component=kube-scheduler

# Check controller-manager
kubectl get pods -n kube-system -l component=kube-controller-manager
kubectl logs -n kube-system -l component=kube-controller-manager

API Server Connectivity

# Test API server
kubectl cluster-info
kubectl get --raw /healthz
kubectl get --raw /readyz

# Check API server logs
kubectl logs -n kube-system kube-apiserver-master

# Check certificates
kubectl get csr
openssl x509 -in /etc/kubernetes/pki/apiserver.crt -text -noout

Essential Debugging Commands

# Pod debugging
kubectl get pods -A -o wide
kubectl describe pod pod-name
kubectl logs pod-name [-c container] [--previous]
kubectl exec -it pod-name -- /bin/sh
kubectl debug pod-name -it --image=nicolaka/netshoot

# Service debugging
kubectl get svc,ep
kubectl describe svc service-name
kubectl port-forward svc/service-name local:remote

# Node debugging
kubectl get nodes -o wide
kubectl describe node node-name
kubectl top nodes

# Events
kubectl get events --sort-by='.lastTimestamp'
kubectl get events --field-selector type=Warning

# Resource usage
kubectl top pods -A
kubectl top nodes

# YAML output for inspection
kubectl get pod pod-name -o yaml
kubectl get deployment deployment-name -o yaml

Debugging Toolkit Pod

# debug-pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: debug-pod
spec:
  containers:
  - name: debug
    image: nicolaka/netshoot
    command: ["sleep", "infinity"]
    securityContext:
      capabilities:
        add: ["NET_ADMIN", "SYS_TIME"]

# Deploy and use
kubectl apply -f debug-pod.yaml
kubectl exec -it debug-pod -- bash

# Inside debug pod:
curl service-name
nslookup service-name
ping pod-ip
tcpdump -i any port 80
netstat -tulpn

Common Quick Fixes

# Force delete stuck pod
kubectl delete pod pod-name --force --grace-period=0

# Restart deployment
kubectl rollout restart deployment/deployment-name

# Scale to zero and back
kubectl scale deployment deployment-name --replicas=0
kubectl scale deployment deployment-name --replicas=3

# Recreate pods
kubectl delete pods -l app=webapp

# Clear failed jobs
kubectl delete jobs --field-selector status.successful=0

Key Takeaways

Systematic approach: Identify, Gather, Analyze, Fix, Verify, Document
kubectl describe is your best friend for debugging
Events often reveal the root cause
Logs provide application-level insights
Network troubleshooting requires testing at multiple layers
Debug pods with networking tools help diagnose connectivity
Always check resources: CPU, memory, storage, quotas

Next Steps

Now that you understand troubleshooting, you’re ready to explore Quality Assurance in the next post. We’ll learn testing strategies and chaos engineering for Kubernetes applications.

Resources for Further Learning

Series Navigation:

Previous: L is for Logging and Monitoring: Observability
Next: Q is for Quality Assurance: Testing in Kubernetes

Complete Series: Kubernetes A-to-Z Series Overview