Troubleshooting
4 min readRapid overview
- Kubernetes Troubleshooting
- Pod Debugging Workflow
- 1. Check Pod Status
- 2. Check Logs
- 3. Execute Commands in Container
- 4. Debug with Ephemeral Containers
- Common Issues and Solutions
- ImagePullBackOff / ErrImagePull
- CrashLoopBackOff
- Pending Pods
- Service Not Accessible
- DNS Issues
- Node Troubleshooting
- Node Not Ready
- Resource Pressure
- Network Troubleshooting
- Connectivity Testing
- Check NetworkPolicies
- Ingress Issues
- Storage Troubleshooting
- PVC Stuck in Pending
- Volume Mount Failures
- Debugging Tools
- kubectl-debug Plugin
- Useful One-Liners
- Resource Inspection
- Monitoring Queries
- Prometheus Queries for Troubleshooting
- Interview Scenarios
- Scenario 1: Pod keeps crashing
- Scenario 2: Service returns 503
- Scenario 3: Deployment rollout stuck
Kubernetes Troubleshooting
Pod Debugging Workflow
1. Check Pod Status
kubectl get pods -o wide
kubectl describe pod <pod-name>
Common Pod States:
| State | Cause | Solution |
|---|---|---|
| Pending | No node available, resource constraints | Check node capacity, resource requests |
| ImagePullBackOff | Image doesn't exist, auth issues | Verify image name, check pull secrets |
| CrashLoopBackOff | Container repeatedly failing | Check logs, verify entrypoint |
| OOMKilled | Memory limit exceeded | Increase limits, fix memory leak |
| Evicted | Node under resource pressure | Check node resources, set priorities |
2. Check Logs
# Current logs
kubectl logs <pod-name> -c <container-name>
# Previous instance (after restart)
kubectl logs <pod-name> --previous
# Follow logs
kubectl logs -f <pod-name>
# Last N lines
kubectl logs --tail=100 <pod-name>
3. Execute Commands in Container
# Interactive shell
kubectl exec -it <pod-name> -- /bin/sh
# Run specific command
kubectl exec <pod-name> -- cat /etc/config/app.yaml
# Multi-container pod
kubectl exec -it <pod-name> -c <container-name> -- /bin/bash
4. Debug with Ephemeral Containers
# Add debug container to running pod
kubectl debug -it <pod-name> --image=busybox --target=<container>
# Create debug copy of pod
kubectl debug <pod-name> -it --copy-to=debug-pod --container=debug
Common Issues and Solutions
ImagePullBackOff / ErrImagePull
Diagnose:
kubectl describe pod <pod-name> | grep -A 5 Events
Causes:
- Image doesn't exist or typo in name
- Private registry without pull secret
- Network issues reaching registry
Solutions:
# Add imagePullSecrets
spec:
imagePullSecrets:
- name: my-registry-secret
# Create the secret
kubectl create secret docker-registry my-registry-secret \
--docker-server=registry.example.com \
--docker-username=user \
--docker-password=pass
CrashLoopBackOff
Diagnose:
kubectl logs <pod-name> --previous
kubectl describe pod <pod-name>
Common Causes:
- Application error at startup
- Missing configuration/secrets
- Liveness probe failing too quickly
- Insufficient resources
Debug Approach:
# Override command to keep container running
spec:
containers:
- name: app
command: ["sleep", "infinity"]
Pending Pods
Diagnose:
kubectl describe pod <pod-name> | grep -A 10 Events
kubectl get nodes
kubectl describe node <node-name>
Common Causes:
- Insufficient CPU/memory on nodes
- Node selector/affinity not matched
- Taints without tolerations
- PVC not bound
Check resources:
kubectl top nodes
kubectl describe nodes | grep -A 5 "Allocated resources"
Service Not Accessible
Diagnose:
# Check service endpoints
kubectl get endpoints <service-name>
# Test from within cluster
kubectl run test --rm -it --image=busybox -- wget -qO- http://<service>:<port>
# Check service selector matches pod labels
kubectl get pods --show-labels
kubectl describe service <service-name>
Common Causes:
- Selector doesn't match pod labels
- Pod not in Ready state
- Wrong port configuration
- NetworkPolicy blocking traffic
DNS Issues
Test DNS resolution:
kubectl run dnstest --rm -it --image=busybox -- nslookup kubernetes.default
kubectl run dnstest --rm -it --image=busybox -- nslookup <service>.<namespace>
Check CoreDNS:
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns
Node Troubleshooting
Node Not Ready
kubectl describe node <node-name>
kubectl get events --field-selector involvedObject.name=<node-name>
Check kubelet:
# On the node
systemctl status kubelet
journalctl -u kubelet -f
Common Causes:
- kubelet not running
- Container runtime issues
- Network plugin problems
- Disk pressure
- Memory pressure
- PID pressure
Resource Pressure
kubectl describe node <node-name> | grep Conditions -A 10
Conditions to watch:
MemoryPressureDiskPressurePIDPressureNetworkUnavailable
Network Troubleshooting
Connectivity Testing
# Create debug pod with network tools
kubectl run netshoot --rm -it --image=nicolaka/netshoot -- /bin/bash
# Inside the pod
curl -v http://service-name:port
ping <pod-ip>
traceroute <destination>
nslookup service-name
Check NetworkPolicies
kubectl get networkpolicies -A
kubectl describe networkpolicy <name>
Ingress Issues
# Check ingress controller logs
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx
# Verify ingress configuration
kubectl describe ingress <ingress-name>
# Check if backend service has endpoints
kubectl get endpoints <backend-service>
Storage Troubleshooting
PVC Stuck in Pending
kubectl describe pvc <pvc-name>
kubectl get pv
kubectl get storageclass
Common Causes:
- No available PV matching claim
- StorageClass doesn't exist
- Provisioner not working
- Volume binding mode waiting for consumer
Volume Mount Failures
kubectl describe pod <pod-name> | grep -A 20 "Volumes:"
kubectl get events --field-selector reason=FailedMount
Debugging Tools
kubectl-debug Plugin
# Install
kubectl krew install debug
# Debug node
kubectl debug node/<node-name> -it --image=ubuntu
Useful One-Liners
# Find pods with restarts
kubectl get pods --all-namespaces | awk '$5 > 0'
# Find pods not running
kubectl get pods -A --field-selector=status.phase!=Running
# Get all events sorted by time
kubectl get events --sort-by='.lastTimestamp' -A
# Watch pod status changes
kubectl get pods -w
# Get resource usage
kubectl top pods --sort-by=memory
kubectl top pods --sort-by=cpu
# Find pods on specific node
kubectl get pods -A -o wide --field-selector spec.nodeName=<node>
# Describe all pods in CrashLoopBackOff
kubectl get pods -A | grep CrashLoop | awk '{print $1, $2}' | xargs -n2 kubectl describe pod -n
# Force delete stuck pod
kubectl delete pod <pod> --grace-period=0 --force
Resource Inspection
# Get YAML with status
kubectl get pod <name> -o yaml
# Get specific fields
kubectl get pods -o jsonpath='{.items[*].status.containerStatuses[*].restartCount}'
# Custom columns
kubectl get pods -o custom-columns=NAME:.metadata.name,STATUS:.status.phase,RESTARTS:.status.containerStatuses[0].restartCount
Monitoring Queries
Prometheus Queries for Troubleshooting
# Pod restart rate
rate(kube_pod_container_status_restarts_total[5m]) > 0
# Containers in waiting state
kube_pod_container_status_waiting_reason
# Pods not ready
kube_pod_status_ready{condition="false"}
# Node memory pressure
kube_node_status_condition{condition="MemoryPressure",status="true"}
# PVC not bound
kube_persistentvolumeclaim_status_phase{phase!="Bound"}
Interview Scenarios
Scenario 1: Pod keeps crashing
- Check logs with
--previousflag - Look for OOMKilled in describe output
- Verify configuration and secrets exist
- Check liveness probe settings
- Test manually with sleep command override
Scenario 2: Service returns 503
- Verify endpoints exist:
kubectl get endpoints - Check if pods are Ready
- Test pod directly with port-forward
- Check NetworkPolicies
- Review readiness probe configuration
Scenario 3: Deployment rollout stuck
- Check rollout status:
kubectl rollout status - Look for pending pods
- Check resource constraints
- Review PodDisruptionBudget
- Check for image pull issues