DevOps / Kubernetes / Troubleshooting

Troubleshooting

4 min read

Rapid overview

Kubernetes Troubleshooting

Pod Debugging Workflow

1. Check Pod Status

kubectl get pods -o wide
kubectl describe pod <pod-name>

Common Pod States:

State	Cause	Solution
Pending	No node available, resource constraints	Check node capacity, resource requests
ImagePullBackOff	Image doesn't exist, auth issues	Verify image name, check pull secrets
CrashLoopBackOff	Container repeatedly failing	Check logs, verify entrypoint
OOMKilled	Memory limit exceeded	Increase limits, fix memory leak
Evicted	Node under resource pressure	Check node resources, set priorities

2. Check Logs

# Current logs
kubectl logs <pod-name> -c <container-name>

# Previous instance (after restart)
kubectl logs <pod-name> --previous

# Follow logs
kubectl logs -f <pod-name>

# Last N lines
kubectl logs --tail=100 <pod-name>

3. Execute Commands in Container

# Interactive shell
kubectl exec -it <pod-name> -- /bin/sh

# Run specific command
kubectl exec <pod-name> -- cat /etc/config/app.yaml

# Multi-container pod
kubectl exec -it <pod-name> -c <container-name> -- /bin/bash

4. Debug with Ephemeral Containers

# Add debug container to running pod
kubectl debug -it <pod-name> --image=busybox --target=<container>

# Create debug copy of pod
kubectl debug <pod-name> -it --copy-to=debug-pod --container=debug

Common Issues and Solutions

ImagePullBackOff / ErrImagePull

Diagnose:

kubectl describe pod <pod-name> | grep -A 5 Events

Causes:

Image doesn't exist or typo in name
Private registry without pull secret
Network issues reaching registry

Solutions:

# Add imagePullSecrets
spec:
  imagePullSecrets:
  - name: my-registry-secret

# Create the secret
kubectl create secret docker-registry my-registry-secret \
  --docker-server=registry.example.com \
  --docker-username=user \
  --docker-password=pass

CrashLoopBackOff

Diagnose:

kubectl logs <pod-name> --previous
kubectl describe pod <pod-name>

Common Causes:

Application error at startup
Missing configuration/secrets
Liveness probe failing too quickly
Insufficient resources

Debug Approach:

# Override command to keep container running
spec:
  containers:
  - name: app
    command: ["sleep", "infinity"]

Pending Pods

Diagnose:

kubectl describe pod <pod-name> | grep -A 10 Events
kubectl get nodes
kubectl describe node <node-name>

Common Causes:

Insufficient CPU/memory on nodes
Node selector/affinity not matched
Taints without tolerations
PVC not bound

Check resources:

kubectl top nodes
kubectl describe nodes | grep -A 5 "Allocated resources"

Service Not Accessible

Diagnose:

# Check service endpoints
kubectl get endpoints <service-name>

# Test from within cluster
kubectl run test --rm -it --image=busybox -- wget -qO- http://<service>:<port>

# Check service selector matches pod labels
kubectl get pods --show-labels
kubectl describe service <service-name>

Common Causes:

Selector doesn't match pod labels
Pod not in Ready state
Wrong port configuration
NetworkPolicy blocking traffic

DNS Issues

Test DNS resolution:

kubectl run dnstest --rm -it --image=busybox -- nslookup kubernetes.default
kubectl run dnstest --rm -it --image=busybox -- nslookup <service>.<namespace>

Check CoreDNS:

kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns

Node Troubleshooting

Node Not Ready

kubectl describe node <node-name>
kubectl get events --field-selector involvedObject.name=<node-name>

Check kubelet:

# On the node
systemctl status kubelet
journalctl -u kubelet -f

Common Causes:

kubelet not running
Container runtime issues
Network plugin problems
Disk pressure
Memory pressure
PID pressure

Resource Pressure

kubectl describe node <node-name> | grep Conditions -A 10

Conditions to watch:

MemoryPressure
DiskPressure
PIDPressure
NetworkUnavailable

Network Troubleshooting

Connectivity Testing

# Create debug pod with network tools
kubectl run netshoot --rm -it --image=nicolaka/netshoot -- /bin/bash

# Inside the pod
curl -v http://service-name:port
ping <pod-ip>
traceroute <destination>
nslookup service-name

Check NetworkPolicies

kubectl get networkpolicies -A
kubectl describe networkpolicy <name>

Ingress Issues

# Check ingress controller logs
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx

# Verify ingress configuration
kubectl describe ingress <ingress-name>

# Check if backend service has endpoints
kubectl get endpoints <backend-service>

Storage Troubleshooting

PVC Stuck in Pending

kubectl describe pvc <pvc-name>
kubectl get pv
kubectl get storageclass

Common Causes:

No available PV matching claim
StorageClass doesn't exist
Provisioner not working
Volume binding mode waiting for consumer

Volume Mount Failures

kubectl describe pod <pod-name> | grep -A 20 "Volumes:"
kubectl get events --field-selector reason=FailedMount

Debugging Tools

kubectl-debug Plugin

# Install
kubectl krew install debug

# Debug node
kubectl debug node/<node-name> -it --image=ubuntu

Useful One-Liners

# Find pods with restarts
kubectl get pods --all-namespaces | awk '$5 > 0'

# Find pods not running
kubectl get pods -A --field-selector=status.phase!=Running

# Get all events sorted by time
kubectl get events --sort-by='.lastTimestamp' -A

# Watch pod status changes
kubectl get pods -w

# Get resource usage
kubectl top pods --sort-by=memory
kubectl top pods --sort-by=cpu

# Find pods on specific node
kubectl get pods -A -o wide --field-selector spec.nodeName=<node>

# Describe all pods in CrashLoopBackOff
kubectl get pods -A | grep CrashLoop | awk '{print $1, $2}' | xargs -n2 kubectl describe pod -n

# Force delete stuck pod
kubectl delete pod <pod> --grace-period=0 --force

Resource Inspection

# Get YAML with status
kubectl get pod <name> -o yaml

# Get specific fields
kubectl get pods -o jsonpath='{.items[*].status.containerStatuses[*].restartCount}'

# Custom columns
kubectl get pods -o custom-columns=NAME:.metadata.name,STATUS:.status.phase,RESTARTS:.status.containerStatuses[0].restartCount

Monitoring Queries

Prometheus Queries for Troubleshooting

# Pod restart rate
rate(kube_pod_container_status_restarts_total[5m]) > 0

# Containers in waiting state
kube_pod_container_status_waiting_reason

# Pods not ready
kube_pod_status_ready{condition="false"}

# Node memory pressure
kube_node_status_condition{condition="MemoryPressure",status="true"}

# PVC not bound
kube_persistentvolumeclaim_status_phase{phase!="Bound"}

Interview Scenarios

Scenario 1: Pod keeps crashing

Check logs with --previous flag
Look for OOMKilled in describe output
Verify configuration and secrets exist
Check liveness probe settings
Test manually with sleep command override

Scenario 2: Service returns 503

Verify endpoints exist: kubectl get endpoints
Check if pods are Ready
Test pod directly with port-forward
Check NetworkPolicies
Review readiness probe configuration

Scenario 3: Deployment rollout stuck

Check rollout status: kubectl rollout status
Look for pending pods
Check resource constraints
Review PodDisruptionBudget
Check for image pull issues