Troubleshooting

4 min read
Rapid overview

Kubernetes Troubleshooting

Pod Debugging Workflow

1. Check Pod Status

kubectl get pods -o wide
kubectl describe pod <pod-name>

Common Pod States:

StateCauseSolution
PendingNo node available, resource constraintsCheck node capacity, resource requests
ImagePullBackOffImage doesn't exist, auth issuesVerify image name, check pull secrets
CrashLoopBackOffContainer repeatedly failingCheck logs, verify entrypoint
OOMKilledMemory limit exceededIncrease limits, fix memory leak
EvictedNode under resource pressureCheck node resources, set priorities

2. Check Logs

# Current logs
kubectl logs <pod-name> -c <container-name>

# Previous instance (after restart)
kubectl logs <pod-name> --previous

# Follow logs
kubectl logs -f <pod-name>

# Last N lines
kubectl logs --tail=100 <pod-name>

3. Execute Commands in Container

# Interactive shell
kubectl exec -it <pod-name> -- /bin/sh

# Run specific command
kubectl exec <pod-name> -- cat /etc/config/app.yaml

# Multi-container pod
kubectl exec -it <pod-name> -c <container-name> -- /bin/bash

4. Debug with Ephemeral Containers

# Add debug container to running pod
kubectl debug -it <pod-name> --image=busybox --target=<container>

# Create debug copy of pod
kubectl debug <pod-name> -it --copy-to=debug-pod --container=debug

Common Issues and Solutions

ImagePullBackOff / ErrImagePull

Diagnose:

kubectl describe pod <pod-name> | grep -A 5 Events

Causes:

  1. Image doesn't exist or typo in name
  2. Private registry without pull secret
  3. Network issues reaching registry

Solutions:

# Add imagePullSecrets
spec:
  imagePullSecrets:
  - name: my-registry-secret

# Create the secret
kubectl create secret docker-registry my-registry-secret \
  --docker-server=registry.example.com \
  --docker-username=user \
  --docker-password=pass

CrashLoopBackOff

Diagnose:

kubectl logs <pod-name> --previous
kubectl describe pod <pod-name>

Common Causes:

  1. Application error at startup
  2. Missing configuration/secrets
  3. Liveness probe failing too quickly
  4. Insufficient resources

Debug Approach:

# Override command to keep container running
spec:
  containers:
  - name: app
    command: ["sleep", "infinity"]

Pending Pods

Diagnose:

kubectl describe pod <pod-name> | grep -A 10 Events
kubectl get nodes
kubectl describe node <node-name>

Common Causes:

  1. Insufficient CPU/memory on nodes
  2. Node selector/affinity not matched
  3. Taints without tolerations
  4. PVC not bound

Check resources:

kubectl top nodes
kubectl describe nodes | grep -A 5 "Allocated resources"

Service Not Accessible

Diagnose:

# Check service endpoints
kubectl get endpoints <service-name>

# Test from within cluster
kubectl run test --rm -it --image=busybox -- wget -qO- http://<service>:<port>

# Check service selector matches pod labels
kubectl get pods --show-labels
kubectl describe service <service-name>

Common Causes:

  1. Selector doesn't match pod labels
  2. Pod not in Ready state
  3. Wrong port configuration
  4. NetworkPolicy blocking traffic

DNS Issues

Test DNS resolution:

kubectl run dnstest --rm -it --image=busybox -- nslookup kubernetes.default
kubectl run dnstest --rm -it --image=busybox -- nslookup <service>.<namespace>

Check CoreDNS:

kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns

Node Troubleshooting

Node Not Ready

kubectl describe node <node-name>
kubectl get events --field-selector involvedObject.name=<node-name>

Check kubelet:

# On the node
systemctl status kubelet
journalctl -u kubelet -f

Common Causes:

  1. kubelet not running
  2. Container runtime issues
  3. Network plugin problems
  4. Disk pressure
  5. Memory pressure
  6. PID pressure

Resource Pressure

kubectl describe node <node-name> | grep Conditions -A 10

Conditions to watch:

  • MemoryPressure
  • DiskPressure
  • PIDPressure
  • NetworkUnavailable

Network Troubleshooting

Connectivity Testing

# Create debug pod with network tools
kubectl run netshoot --rm -it --image=nicolaka/netshoot -- /bin/bash

# Inside the pod
curl -v http://service-name:port
ping <pod-ip>
traceroute <destination>
nslookup service-name

Check NetworkPolicies

kubectl get networkpolicies -A
kubectl describe networkpolicy <name>

Ingress Issues

# Check ingress controller logs
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx

# Verify ingress configuration
kubectl describe ingress <ingress-name>

# Check if backend service has endpoints
kubectl get endpoints <backend-service>

Storage Troubleshooting

PVC Stuck in Pending

kubectl describe pvc <pvc-name>
kubectl get pv
kubectl get storageclass

Common Causes:

  1. No available PV matching claim
  2. StorageClass doesn't exist
  3. Provisioner not working
  4. Volume binding mode waiting for consumer

Volume Mount Failures

kubectl describe pod <pod-name> | grep -A 20 "Volumes:"
kubectl get events --field-selector reason=FailedMount

Debugging Tools

kubectl-debug Plugin

# Install
kubectl krew install debug

# Debug node
kubectl debug node/<node-name> -it --image=ubuntu

Useful One-Liners

# Find pods with restarts
kubectl get pods --all-namespaces | awk '$5 > 0'

# Find pods not running
kubectl get pods -A --field-selector=status.phase!=Running

# Get all events sorted by time
kubectl get events --sort-by='.lastTimestamp' -A

# Watch pod status changes
kubectl get pods -w

# Get resource usage
kubectl top pods --sort-by=memory
kubectl top pods --sort-by=cpu

# Find pods on specific node
kubectl get pods -A -o wide --field-selector spec.nodeName=<node>

# Describe all pods in CrashLoopBackOff
kubectl get pods -A | grep CrashLoop | awk '{print $1, $2}' | xargs -n2 kubectl describe pod -n

# Force delete stuck pod
kubectl delete pod <pod> --grace-period=0 --force

Resource Inspection

# Get YAML with status
kubectl get pod <name> -o yaml

# Get specific fields
kubectl get pods -o jsonpath='{.items[*].status.containerStatuses[*].restartCount}'

# Custom columns
kubectl get pods -o custom-columns=NAME:.metadata.name,STATUS:.status.phase,RESTARTS:.status.containerStatuses[0].restartCount

Monitoring Queries

Prometheus Queries for Troubleshooting

# Pod restart rate
rate(kube_pod_container_status_restarts_total[5m]) > 0

# Containers in waiting state
kube_pod_container_status_waiting_reason

# Pods not ready
kube_pod_status_ready{condition="false"}

# Node memory pressure
kube_node_status_condition{condition="MemoryPressure",status="true"}

# PVC not bound
kube_persistentvolumeclaim_status_phase{phase!="Bound"}

Interview Scenarios

Scenario 1: Pod keeps crashing

  1. Check logs with --previous flag
  2. Look for OOMKilled in describe output
  3. Verify configuration and secrets exist
  4. Check liveness probe settings
  5. Test manually with sleep command override

Scenario 2: Service returns 503

  1. Verify endpoints exist: kubectl get endpoints
  2. Check if pods are Ready
  3. Test pod directly with port-forward
  4. Check NetworkPolicies
  5. Review readiness probe configuration

Scenario 3: Deployment rollout stuck

  1. Check rollout status: kubectl rollout status
  2. Look for pending pods
  3. Check resource constraints
  4. Review PodDisruptionBudget
  5. Check for image pull issues