Pods are the foundational building blocks of Kubernetes applications, but they can fail for numerous reasons. This guide will help you systematically diagnose and resolve common pod failures.
Understanding Pod Lifecycle and Common Failure States
When troubleshooting pod issues, it’s important to understand the various states a pod can be in:
- Pending: The pod has been accepted but containers aren’t running yet
- Running: The pod is bound to a node and all containers are running
- Succeeded: All containers have terminated successfully
- Failed: All containers have terminated, but at least one terminated with failure
- Unknown: The state of the pod can’t be determined
Step-by-Step Troubleshooting Process
1. Identify the Problem Pod and Its Status
# List all pods in the namespace
kubectl get pods -n <namespace>
# Get more details about the specific pod
kubectl describe pod <pod-name> -n <namespace>
Look for:
- The pod’s status (Pending, CrashLoopBackOff, ImagePullBackOff, Error, etc.)
- Events section for error messages
- Container statuses
- Restart counts
2. Check Pod Logs
# View logs of a specific container
kubectl logs <pod-name> -c <container-name> -n <namespace>
# View logs for a pod with a single container
kubectl logs <pod-name> -n <namespace>
# View previous container logs if it has restarted
kubectl logs <pod-name> -n <namespace> --previous
3. Resolving Common Pod Issues
Issue: ImagePullBackOff or ErrImagePull
This indicates Kubernetes can’t pull the container image.
Solutions:
- Verify the image name and tag are correct
- Check if the image exists in the specified registry
- Ensure proper image pull secrets are configured:
# Check if image pull secrets are configured
kubectl get pod <pod-name> -n <namespace> -o yaml | grep imagePullSecrets -A 5
# Create a new image pull secret
kubectl create secret docker-registry <secret-name> \
--docker-server=<registry-server> \
--docker-username=<username> \
--docker-password=<password> \
--docker-email=<email> \
-n <namespace>
# Patch the service account to use the secret
kubectl patch serviceaccount <service-account-name> \
-p '{"imagePullSecrets": [{"name": "<secret-name>"}]}' \
-n <namespace>
Issue: CrashLoopBackOff
The container is starting, crashing, and restarting repeatedly.
Solutions:
- Check container logs for application errors
- Verify the container’s health checks are properly configured
- Check if the app can run with the allocated resources
- Debug the application inside the container:
# Run a debug container in the pod's namespace
kubectl debug <pod-name> -it --image=busybox -n <namespace> -- sh
# Or execute into the running container if possible
kubectl exec -it <pod-name> -c <container-name> -n <namespace> -- sh
Issue: Pending State
Pod remains in Pending state and doesn’t get scheduled.
Solutions:
- Check if the cluster has enough resources:
# Check node capacity and allocatable resources
kubectl describe nodes
# Check if the pod is requesting resources that exceed node capacity
kubectl describe pod <pod-name> -n <namespace> | grep -A 3 Requests
- Verify if pod has node affinity/taints that prevent scheduling:
# Look for node affinity rules
kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A 10 affinity
# Check for node taints
kubectl describe nodes | grep Taints
- Check for PersistentVolumeClaim issues:
# List PVCs and their status
kubectl get pvc -n <namespace>
# Check details of a specific PVC
kubectl describe pvc <pvc-name> -n <namespace>
Issue: Pod in ContainerCreating State
Pod is stuck in ContainerCreating state.
Solutions:
- Check for volume mount issues:
# Look for mount errors in describe output
kubectl describe pod <pod-name> -n <namespace> | grep -A 10 Events
- Check if the kubelet can pull images:
# Check kubelet logs on the node
sudo journalctl -u kubelet
4. Debugging Networking Issues
If pods can’t communicate with each other or external services:
# Deploy a network debugging pod
kubectl run network-debug --rm -it --image=nicolaka/netshoot -n <namespace> --
# From inside the pod, test connectivity
ping <service-name>
curl <service-name>:<port>
nslookup <service-name>
5. Checking Resource Constraints
# Check if the pod is being throttled due to resource limits
kubectl top pod <pod-name> -n <namespace>
# Check resource quotas in the namespace
kubectl describe quota -n <namespace>
Practical Example: Resolving a CrashLoopBackOff Issue
Let’s say we have a pod named web-app
that’s in CrashLoopBackOff
state:
# Check the pod status
kubectl get pod web-app -n production
# NAME READY STATUS RESTARTS AGE
# web-app 0/1 CrashLoopBackOff 5 10m
# Check the pod details
kubectl describe pod web-app -n production
# [Events section shows container exiting with code 1]
# Check the logs
kubectl logs web-app -n production
# Error: could not connect to database at db-service:5432
# Verify if the database service exists and is running
kubectl get svc db-service -n production
# No resources found
# The issue is that the database service doesn't exist
# Create the missing service and deployment
kubectl apply -f database-deployment.yaml -n production
Preventive Measures
- Use Liveness and Readiness Probes: Implement appropriate probes to detect and recover from application failures:
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 15
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
- Set Resource Requests and Limits: Ensure your pods have appropriate resource requests and limits:
resources:
requests:
memory: "128Mi"
cpu: "100m"
limits:
memory: "256Mi"
cpu: "200m"
- Implement Pod Disruption Budgets: Protect your applications during voluntary disruptions:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: app-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: my-app
By following this systematic approach to troubleshooting pod failures, you can quickly identify and resolve issues in your Kubernetes environment, minimizing downtime and maintaining application reliability.