[17/24] Q is for Quality Assurance: Testing in Kubernetes
This is Post #15 in the Kubernetes A-to-Z Series
Reading Order: Previous: Troubleshooting | Next: Upgrades
Series Progress: 17/24 complete | Difficulty: Advanced | Time: 30-35 min | Part 5/6: Operations
Welcome to the fifteenth post in our Kubernetes A-to-Z Series! Now that you understand troubleshooting, let’s explore Quality Assurance - testing strategies and chaos engineering practices that ensure your Kubernetes applications are reliable and resilient.
Testing Pyramid for Kubernetes
┌─────────────────────────────────────────────────┐
│ Testing Pyramid │
│ │
│ /\ │
│ / \ E2E Tests │
│ / \ (Few, Slow, Expensive) │
│ /──────\ │
│ / \ Integration Tests │
│ / \ (Medium) │
│ /────────────\ │
│ / \ Unit Tests │
│ / \ (Many, Fast, Cheap) │
│ /──────────────────\ │
└─────────────────────────────────────────────────┘
Unit Testing Kubernetes Manifests
Validating YAML with kubeconform
# Install kubeconform
brew install kubeconform
# Validate manifests
kubeconform -summary deployment.yaml
kubeconform -strict -kubernetes-version 1.28.0 ./manifests/
# Validate Helm templates
helm template mychart | kubeconform -summary
Policy Testing with Conftest
# Install conftest
brew install conftest
# Create policy
mkdir policy
# policy/deployment.rego
package main
deny[msg] {
input.kind == "Deployment"
not input.spec.template.spec.securityContext.runAsNonRoot
msg = "Containers must run as non-root"
}
deny[msg] {
input.kind == "Deployment"
container := input.spec.template.spec.containers[_]
not container.resources.limits
msg = sprintf("Container %s must have resource limits", [container.name])
}
deny[msg] {
input.kind == "Deployment"
container := input.spec.template.spec.containers[_]
container.securityContext.privileged
msg = sprintf("Container %s must not be privileged", [container.name])
}
# Run policy tests
conftest test deployment.yaml
conftest test --policy policy/ ./manifests/
Helm Chart Testing
# charts/webapp/tests/deployment_test.yaml
suite: deployment tests
templates:
- deployment.yaml
tests:
- it: should set correct replicas
set:
replicaCount: 5
asserts:
- equal:
path: spec.replicas
value: 5
- it: should set resource limits
asserts:
- isNotNull:
path: spec.template.spec.containers[0].resources.limits
- it: should use correct image
set:
image.repository: myapp
image.tag: v1.0.0
asserts:
- equal:
path: spec.template.spec.containers[0].image
value: myapp:v1.0.0
# Install helm-unittest plugin
helm plugin install https://github.com/helm-unittest/helm-unittest
# Run tests
helm unittest ./charts/webapp
Integration Testing
Testing with Kind
# Create test cluster
kind create cluster --name test-cluster
# Deploy application
kubectl apply -f manifests/
# Run integration tests
go test ./integration/... -v
# Cleanup
kind delete cluster --name test-cluster
Integration Test Example
// integration/deployment_test.go
package integration
import (
"context"
"testing"
"time"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/client-go/kubernetes"
"k8s.io/client-go/tools/clientcmd"
)
func TestDeploymentCreation(t *testing.T) {
config, err := clientcmd.BuildConfigFromFlags("", clientcmd.RecommendedHomeFile)
if err != nil {
t.Fatal(err)
}
clientset, err := kubernetes.NewForConfig(config)
if err != nil {
t.Fatal(err)
}
ctx, cancel := context.WithTimeout(context.Background(), 2*time.Minute)
defer cancel()
// Wait for deployment to be ready
for {
deployment, err := clientset.AppsV1().Deployments("default").Get(ctx, "webapp", metav1.GetOptions{})
if err != nil {
t.Fatal(err)
}
if deployment.Status.ReadyReplicas == *deployment.Spec.Replicas {
break
}
select {
case <-ctx.Done():
t.Fatal("Timeout waiting for deployment")
case <-time.After(5 * time.Second):
}
}
}
End-to-End Testing
E2E with Cypress/Playwright
# e2e-test-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: e2e-tests
spec:
template:
spec:
containers:
- name: e2e
image: cypress/included:12.0.0
env:
- name: CYPRESS_BASE_URL
value: "http://webapp-service"
command: ["cypress", "run"]
volumeMounts:
- name: tests
mountPath: /e2e
volumes:
- name: tests
configMap:
name: e2e-tests
restartPolicy: Never
backoffLimit: 2
Smoke Tests
# smoke-test.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: smoke-test
spec:
template:
spec:
containers:
- name: smoke-test
image: curlimages/curl:latest
command:
- /bin/sh
- -c
- |
set -e
echo "Testing health endpoint..."
curl -f http://webapp-service/health
echo "Testing API endpoint..."
curl -f http://webapp-service/api/status
echo "All smoke tests passed!"
restartPolicy: Never
backoffLimit: 3
Chaos Engineering
Chaos Mesh
# Install Chaos Mesh
helm repo add chaos-mesh https://charts.chaos-mesh.org
helm install chaos-mesh chaos-mesh/chaos-mesh \
--namespace chaos-testing \
--create-namespace
Pod Chaos Experiments
# pod-failure.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pod-failure-test
namespace: chaos-testing
spec:
action: pod-failure
mode: one
duration: "60s"
selector:
namespaces:
- production
labelSelectors:
app: webapp
scheduler:
cron: "@every 2h"
---
# pod-kill.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pod-kill-test
spec:
action: pod-kill
mode: fixed-percent
value: "30"
duration: "30s"
selector:
namespaces:
- production
labelSelectors:
app: webapp
Network Chaos
# network-delay.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: network-delay-test
spec:
action: delay
mode: all
selector:
namespaces:
- production
labelSelectors:
app: webapp
delay:
latency: "200ms"
correlation: "50"
jitter: "50ms"
duration: "5m"
---
# network-partition.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: network-partition-test
spec:
action: partition
mode: all
selector:
namespaces:
- production
labelSelectors:
app: frontend
direction: to
target:
selector:
namespaces:
- production
labelSelectors:
app: backend
duration: "2m"
Stress Testing
# stress-cpu.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
name: cpu-stress-test
spec:
mode: one
selector:
namespaces:
- production
labelSelectors:
app: webapp
stressors:
cpu:
workers: 2
load: 80
duration: "5m"
---
# stress-memory.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
name: memory-stress-test
spec:
mode: one
selector:
namespaces:
- production
labelSelectors:
app: webapp
stressors:
memory:
workers: 2
size: "256MB"
duration: "5m"
LitmusChaos
# litmus-experiment.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: webapp-chaos
namespace: production
spec:
appinfo:
appns: production
applabel: app=webapp
appkind: deployment
chaosServiceAccount: litmus-admin
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: "30"
- name: CHAOS_INTERVAL
value: "10"
- name: FORCE
value: "false"
CI/CD Integration
GitHub Actions Pipeline
# .github/workflows/k8s-test.yaml
name: Kubernetes Tests
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Validate manifests
uses: instrumenta/kubeval-action@master
with:
files: ./manifests
- name: Policy tests
run: |
conftest test ./manifests --policy ./policy
integration:
runs-on: ubuntu-latest
needs: validate
steps:
- uses: actions/checkout@v4
- name: Create Kind cluster
uses: helm/kind-action@v1
- name: Deploy application
run: |
kubectl apply -f ./manifests
kubectl wait --for=condition=ready pod -l app=webapp --timeout=120s
- name: Run integration tests
run: |
kubectl apply -f ./tests/smoke-test.yaml
kubectl wait --for=condition=complete job/smoke-test --timeout=60s
- name: Run E2E tests
run: |
npm run test:e2e
chaos:
runs-on: ubuntu-latest
needs: integration
if: github.ref == 'refs/heads/main'
steps:
- uses: actions/checkout@v4
- name: Create Kind cluster
uses: helm/kind-action@v1
- name: Install Chaos Mesh
run: |
helm repo add chaos-mesh https://charts.chaos-mesh.org
helm install chaos-mesh chaos-mesh/chaos-mesh -n chaos-testing --create-namespace
- name: Deploy application
run: kubectl apply -f ./manifests
- name: Run chaos experiments
run: |
kubectl apply -f ./chaos/pod-failure.yaml
sleep 120
kubectl get pods -l app=webapp
GitLab CI Pipeline
# .gitlab-ci.yml
stages:
- validate
- test
- chaos
validate-manifests:
stage: validate
image: instrumenta/kubeval
script:
- kubeval --strict ./manifests/*.yaml
policy-check:
stage: validate
image: openpolicyagent/conftest
script:
- conftest test ./manifests --policy ./policy
integration-tests:
stage: test
image: bitnami/kubectl
services:
- name: docker:dind
before_script:
- kind create cluster
script:
- kubectl apply -f ./manifests
- kubectl wait --for=condition=ready pod -l app=webapp --timeout=120s
- kubectl apply -f ./tests/smoke-test.yaml
- kubectl wait --for=condition=complete job/smoke-test --timeout=60s
chaos-tests:
stage: chaos
only:
- main
script:
- kubectl apply -f ./chaos/experiments/
- sleep 300
- ./scripts/verify-recovery.sh
Load Testing
k6 Load Tests
// load-test.js
import http from 'k6/http';
import { check, sleep } from 'k6';
export const options = {
stages: [
{ duration: '2m', target: 100 },
{ duration: '5m', target: 100 },
{ duration: '2m', target: 200 },
{ duration: '5m', target: 200 },
{ duration: '2m', target: 0 },
],
thresholds: {
http_req_duration: ['p(95)<500'],
http_req_failed: ['rate<0.01'],
},
};
export default function () {
const res = http.get('http://webapp-service/api/data');
check(res, {
'status is 200': (r) => r.status === 200,
'response time < 500ms': (r) => r.timings.duration < 500,
});
sleep(1);
}
# k6-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: load-test
spec:
template:
spec:
containers:
- name: k6
image: grafana/k6:latest
command: ["k6", "run", "/scripts/load-test.js"]
volumeMounts:
- name: scripts
mountPath: /scripts
volumes:
- name: scripts
configMap:
name: k6-scripts
restartPolicy: Never
Security Testing
Trivy Container Scanning
# trivy-scan-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: trivy-scan
spec:
template:
spec:
containers:
- name: trivy
image: aquasec/trivy:latest
command:
- trivy
- image
- --severity
- HIGH,CRITICAL
- --exit-code
- "1"
- myapp:latest
restartPolicy: Never
Kube-bench Security Audit
# Run kube-bench
kubectl apply -f https://raw.githubusercontent.com/aquasecurity/kube-bench/main/job.yaml
# Check results
kubectl logs -l app=kube-bench
Key Takeaways
- Test pyramid: Unit tests at base, E2E at top
- Validate manifests with kubeconform and conftest
- Integration tests use Kind for ephemeral clusters
- Chaos engineering validates resilience with Chaos Mesh or Litmus
- CI/CD integration automates testing on every change
- Load testing with k6 ensures performance under stress
- Security scanning catches vulnerabilities early
Next Steps
Now that you understand QA practices, you’re ready to explore Authentication and RBAC in the next post for securing your Kubernetes cluster.
Resources for Further Learning
Series Navigation:
- Previous: T is for Troubleshooting
- Next: U is for Upgrades: Cluster Lifecycle
Complete Series: Kubernetes A-to-Z Series Overview