Modern engineering teams rely on Kubernetes to deploy and scale applications efficiently. But when deployments fail, debugging can quickly consume valuable engineering time. Pods remain stuck, containers crash, services don’t route traffic, or applications fail health checks.
This guide expands on a practical Deployment Failure Diagnostic Flow and turns it into a detailed troubleshooting reference for engineers working in development, staging, or production Kubernetes environments.
Pro Tip: If your application worked in staging but failed in production, start by checking Secrets, ConfigMaps, and environment-specific configurations. Misconfiguration is one of the most common causes of production deployment failures.
Why Kubernetes Deployments Fail
A Kubernetes deployment can fail due to issues in one or more layers:
- Infrastructure Layer – insufficient nodes, CPU, memory, scheduling constraints
- Container Layer – image pull errors, registry authentication failures
- Application Layer – startup crashes, bad configs, runtime exceptions
- Health Check Layer – readiness/liveness probe failures
- Networking Layer – services, ingress, DNS, SSL issues
The fastest way to troubleshoot is to move through these layers systematically.
Step 1: The Quick Look
The first command every engineer should run:
kubectl get podsCheck the STATUS column. It often immediately tells you where the issue is.
Common Pod Status Errors
1. Pending
If the pod stays in Pending, Kubernetes cannot schedule it.
Typical Causes:
- No available nodes
- CPU or memory requests too high
- Node selectors mismatch
- Taints/tolerations issue
- Affinity / anti-affinity restrictions
Diagnose:
kubectl describe pod <pod-name>Look for scheduler messages such as:
0/5 nodes available: insufficient memory
Fixes
- Reduce resource requests
- Add nodes / scale cluster
- Correct node selectors
- Update tolerations
2. CrashLoopBackOff
The pod starts, crashes, and Kubernetes keeps restarting it.
Typical Causes:
- App startup failure
- Missing environment variables
- Database connection failure
- Wrong command or entrypoint
- Dependency service unavailable
Diagnose:
kubectl logs <pod-name>
kubectl logs <pod-name> --previousFixes
- Correct startup command
- Validate configs and secrets
- Check external dependencies
- Patch runtime exceptions
3. ImagePullBackOff
Kubernetes cannot pull the container image.
Typical Causes
- Wrong image tag
- Private registry authentication issue
- Image not pushed
- Network restrictions
Diagnose:
kubectl describe pod <pod-name>Look in Events for errors like:
Failed to pull image
403 Forbidden
Image not found
Fixes:
- Correct image tag
- Verify registry credentials
- Add imagePullSecrets
- Confirm image exists
Step 2: Deep Diagnostic
If pod status alone doesn’t reveal enough, move deeper.
A. Is It the Image?
Run:
kubectl describe pod <pod-name>Review the Events section carefully.
If you see: 403 Forbidden
This usually means the node or workload identity lacks permission to pull from the registry (ECR, GCR, ACR, Docker registry, etc.).
Fixes by Cloud Provider
- AWS: Check IAM role attached to node group or IRSA role
- Azure: Validate Managed Identity / ACR permissions
- GCP: Check Workload Identity or node service account permissions
B. Is It Resource Exhaustion?
Pods may start but terminate unexpectedly due to memory pressure.
Check:
kubectl describe pod <pod-name>If you see: OOMKilled
That means the container exceeded memory limits.
Fixes
Increase limits:
resources:
requests:
memory: "512Mi"
limits:
memory: "1Gi"Or investigate:
- Memory leaks
- Large caches
- High concurrency load
Tip: Set realistic requests so the scheduler places pods correctly.
C. Is It the Readiness Probe?
Sometimes the pod is Running, but traffic never reaches it.
That usually means readiness checks are failing.
Example:
readinessProbe:
httpGet:
path: /health
port: 8080If /health returns 404 or 500, Kubernetes marks the pod as Not Ready.
Diagnose
kubectl describe pod <pod-name>Look for:
Readiness probe failed
Fixes
- Correct endpoint path
- Increase startup delay
- Increase timeout
- Ensure dependencies initialize before probe starts
Step 3: The Network Wall
If pods are healthy but users still can’t access the app, check networking.
A. Service Check
Run:
kubectl get svcThen verify selectors:
kubectl describe svc <service-name>Does the service selector match pod labels?
Example mismatch:
selector:
app: frontendBut pod label:
labels:
app: webNo endpoints will be created.
Fix
Align labels and selectors.
B. Ingress Check
If the service works internally but not externally:
kubectl get ingress
kubectl describe ingress <ingress-name>Check for:
- Invalid TLS certificate
- Wrong backend service
- Host mismatch
- Ingress controller errors
- 502 / 503 upstream failures
Recommended Troubleshooting Workflow
Use this sequence every time:
1. kubectl get pods
2. kubectl describe pod
3. kubectl logs
4. Check resources
5. Check probes
6. Check service selectors
7. Check ingress/controller logs
8. Compare prod vs staging configs
Production Best Practices to Prevent Deployment Failures
1. Use GitOps
Track every config change in version control.
2. Standardize Health Checks
Use common readiness/liveness patterns across services.
3. Validate Resources
Use requests/limits baselines for each service type.
4. Use Secrets Management
Avoid manual secret injection drift between environments.
5. Add Alerting
Monitor:
- CrashLoopBackOff
- Pending pods
- OOMKilled containers
- 5xx ingress spikes
Example Real-World Scenario
Problem
Deployment successful, but website down.
Root Cause
Pods were healthy, but Service selector mismatched labels.
Resolution
Updated service selector:
selector:
app: webTraffic restored instantly.
Final Thoughts
Most Kubernetes deployment failures are not random—they follow predictable patterns. A structured troubleshooting flow helps engineers reduce downtime, avoid guesswork, and restore services faster.
Instead of manually chasing symptoms, inspect:
- Pod state
- Events
- Logs
- Resources
- Probes
- Networking
When teams use a repeatable diagnostic process, Kubernetes becomes easier to operate at scale.









