Introduction
Kubernetes deployments fail for ordinary reasons, not mysterious ones. A manifest can validate cleanly and still produce broken rollouts because the image tag is wrong, a probe is too aggressive, a Secret is missing, or a pod cannot be scheduled onto any node. That is why troubleshooting in Kubernetes is a process, not a guess.
This guide focuses on practical kubectl commands you can use under pressure. The goal is simple: determine whether the failure sits in scheduling, image pulling, application startup, networking, configuration, or resource limits. Once you know the layer, the fix becomes much faster.
You do not need to inspect every object in the cluster. You need a repeatable path that starts with deployment errors, moves through pod events and logs, and ends with the exact control point causing the outage. That is the difference between flailing and real cluster diagnostics.
For teams running production workloads, that discipline matters. CNCF adoption data and vendor documentation show how much operational complexity sits inside Kubernetes-based systems, which is why a fast diagnostic method saves time every week. Vision Training Systems teaches this same workflow mindset: identify the layer, confirm the symptom, then isolate the root cause.
Understanding The Deployment Problem Space
A Deployment manages rollout strategy and desired state. A ReplicaSet ensures the requested number of pod replicas exist. A Pod is the smallest deployable unit. A container is the runtime process inside the pod. Each layer can fail differently, and Kubernetes reports those failures at different points in the chain.
That matters because a healthy Deployment does not always mean healthy pods. A ReplicaSet can exist while pods stay pending. A pod can run while the container crashes. A container can start while the service remains unreachable. The status line you see is often a clue, not the full answer.
Common failure patterns include Pending, CrashLoopBackOff, ImagePullBackOff, and running pods that still cannot serve traffic. Pending usually points to scheduling or resource pressure. Image pull failures usually point to image names, registry access, or network access. Crash loops usually point to app startup, configuration, or probe problems.
Always check the namespace first. Many “missing resource” issues are really scope issues. If you query the wrong namespace, the deployment may look absent even though it is present and failing elsewhere. That simple mistake wastes time during a production incident.
Note
Kubernetes status values are symptoms, not final diagnoses. Treat them as the start of cluster diagnostics, not the end.
For a useful mental model, read each layer as a dependency chain. If the Deployment is fine but the ReplicaSet is stale, the rollout may be blocked. If the ReplicaSet is fine but the Pod is pending, the scheduler is likely the bottleneck. If the Pod is running but the container is crash looping, the application or its runtime settings are the issue.
Starting With High-Level Status Checks for Kubernetes Troubleshooting
Start broad. Use kubectl get deployments, kubectl get pods, and kubectl get rs to see what is unhealthy before you dive into details. These commands give you a quick comparison of desired state versus actual state. If the numbers do not match, you already know the rollout is incomplete.
The READY column is especially useful. A deployment showing 3/3 ready is healthy in a way that 0/3 or 2/3 is not. The AGE column also helps. A workload created two minutes ago that is still not ready deserves immediate attention. A workload that has been stable for days and suddenly degraded likely changed recently or hit a new dependency failure.
kubectl describe deployment gives you events, rollout conditions, and replica information in one place. It often reveals whether the rollout is stalled, whether old ReplicaSets still exist, and whether Kubernetes is waiting for pods to become available. For quick confirmation, kubectl rollout status tells you whether the Deployment is progressing, stuck, or timed out.
A practical check is to compare desired replicas and available replicas. If desired is 6 and available is 0, you have a total failure. If desired is 6 and available is 5, the issue may be partial, localized, or intermittent. That distinction matters because the fix may be different.
- Use kubectl get deployments -n <namespace> to check desired versus available replicas.
- Use kubectl get pods -n <namespace> to see pod phase and restart counts.
- Use kubectl get rs -n <namespace> to identify whether a new ReplicaSet was created during rollout.
- Use kubectl rollout status deployment/<name> -n <namespace> to confirm progress or timeout.
Kubernetes rarely “breaks” in one place. The fastest fix usually comes from tracing the first unhealthy object in the chain.
Inspecting Pod Events And Conditions
kubectl describe pod is one of the most useful troubleshooting commands in Kubernetes. It shows pod metadata, container states, conditions, volumes, environment references, and the event stream. That event stream often contains the exact reason the pod is not behaving as expected.
Look for event messages about image pull failures, scheduling constraints, probe failures, or volume mount issues. For example, “Failed to pull image” points you to registry access or a bad image reference. “0/3 nodes are available” points to cluster capacity, taints, or affinity rules. “Readiness probe failed” often means the app is live but not ready to serve traffic yet.
Pod conditions matter too. Initialized means init containers completed. PodScheduled means the scheduler assigned a node. ContainersReady means the containers passed readiness checks. Ready means the pod can receive traffic according to Kubernetes. A pod can be running and still not be Ready.
Timestamps and repeated events help you determine whether the problem is persistent or intermittent. If the same warning repeats every few seconds, the failure is probably continuous. If events stop after a single occurrence, you may be seeing a transient issue. Always check the most recent events first, especially when a pod has many warnings or restart cycles.
Pro Tip
When pod events are noisy, focus on the newest failure messages and the first event after scheduling. That sequence usually points to the root cause faster than reading from the top.
Useful commands include:
- kubectl describe pod <pod-name> -n <namespace>
- kubectl get pod <pod-name> -n <namespace> -o wide
- kubectl get events -n <namespace> –sort-by=.lastTimestamp
Diagnosing Image And Container Startup Issues
ImagePullBackOff and ErrImagePull usually mean the container image could not be retrieved. Common causes include a wrong image name, a bad tag, missing private registry credentials, or network problems reaching the registry. If the image is spelled correctly but the tag does not exist, Kubernetes will keep retrying and backing off.
Verify the image reference in the Deployment manifest and compare it with the running pod spec. A deployment might point to app:v2 while the pod spec is still using app:v1 because the rollout has not completed. That difference matters when you are chasing a behavior change after a release.
CrashLoopBackOff means the container starts, exits, and then gets restarted repeatedly. This is not an image problem. It is usually an application startup problem, a missing environment variable, a bad config file, or a command/entrypoint issue. Use kubectl logs to inspect output, and use kubectl logs –previous when the container has already restarted.
Also check the container command, args, entrypoint behavior, and environment variables. A good image can fail immediately if the command expects a configuration file that is not mounted, or if an environment variable is empty. In real incidents, this is one of the most common sources of deployment errors.
According to the Kubernetes documentation, container status and event messages are the primary indicators for image and runtime failures. That official guidance matches what operators see in production: the container tells you what failed if you read the logs carefully.
- Check kubectl describe pod for pull and startup events.
- Check kubectl logs <pod> for current container output.
- Check kubectl logs –previous <pod> for the last crashed instance.
- Compare the deployment image tag with the image actually running in the pod.
Warning
Do not assume a CrashLoopBackOff is caused by Kubernetes itself. In many cases, the app is failing before it ever becomes ready enough for traffic.
Checking Resource Limits, Probes, And Scheduling Constraints
Insufficient CPU or memory requests can delay scheduling or trigger node pressure issues. If a pod requests more resources than any node can provide, it will sit pending. If limits are too tight, the container may be throttled or killed by the kernel, leading to instability and restarts.
kubectl describe pod is the best way to see failed scheduling reasons such as taints, node selectors, affinity rules, or insufficient resources. The scheduler often explains exactly why it rejected a node. If every node fails the same predicate, the issue is usually in the pod specification, not the cluster.
Read liveness, readiness, and startup probes separately. Liveness probes restart a container that appears unhealthy. Readiness probes keep traffic away until the app is ready. Startup probes give slow-starting applications time to initialize. If a liveness probe is too aggressive, it can kill a container that would have become healthy a few seconds later.
Probe failures show up in events. Correlate those timestamps with the application startup time. If your app needs 90 seconds to warm a cache and your startup probe allows only 30 seconds, repeated restarts are expected. For a slower Java or .NET service, that mismatch is a common root cause.
Resource limits can also produce OOMKilled states. That means the process exceeded its memory limit and was terminated. CPU throttling is less visible but can still cause timeouts, failed readiness checks, and poor responsiveness under load.
The Kubernetes resource management documentation explains how requests and limits influence scheduling and runtime behavior. For operators, the practical lesson is simple: make resource settings reflect reality, not hope.
| Problem Pattern | Likely Cause |
|---|---|
| Pending pod with no node assigned | Insufficient resources, taints, affinity mismatch |
| Running pod that keeps restarting | OOMKilled, bad command, failing probe, app crash |
| Pod marked unready but not restarting | Readiness probe failure or dependency outage |
Use these checks to separate cluster-level scheduling problems from application-level startup issues. That distinction shortens troubleshooting immediately.
Investigating Configuration, Secrets, And Environment Problems
Incorrect ConfigMaps, Secrets, or environment variable values can break a container that otherwise appears healthy. The image is fine. The runtime is fine. The application simply cannot find the values it needs to start correctly.
Use kubectl describe pod to inspect mounted volumes and environment references. Use kubectl get deployment -o yaml when you need the full template, because the deployment often shows the exact config keys, secret names, mount paths, and variable references. That is where many deployment errors hide.
Missing keys, wrong file paths, or incorrectly encoded Secret values often appear as application startup errors. For example, a pod may look healthy from Kubernetes’ point of view while the application logs report “config file not found” or “invalid token format.” Base64 encoding mistakes are especially common when Secrets are created manually.
Always verify namespace consistency. A ConfigMap in one namespace does not satisfy a pod in another. A Secret that exists in dev will not magically work in prod. This sounds obvious, but namespace drift causes a surprising number of support tickets.
When updating configuration, do it safely. Confirm whether the application reloads config dynamically or needs a restart to pick up changes. If it requires a restart, recycle the pods after the update so you are testing the new data, not the old cached values.
Key Takeaway
When a container fails without a Kubernetes-level error, configuration is a top suspect. Check namespaces, keys, paths, and restart behavior before changing the image.
- Inspect environment values in the pod spec.
- Confirm ConfigMap and Secret names exist in the same namespace.
- Validate mount paths and file permissions.
- Restart pods if the app only reads configuration on startup.
Debugging Networking And Service Reachability
A pod can be running successfully while the service is still unreachable. That is a common source of confusion in Kubernetes. The pod may be healthy, but the Service selector does not match labels, the endpoints are empty, or ingress rules are misconfigured.
Use kubectl get svc, kubectl describe svc, and kubectl get endpoints to check whether the Service is attached to backend pods. If the service has no endpoints, traffic cannot be routed, even if the pods themselves are fine. Label mismatches are one of the most frequent causes.
Testing from inside the cluster is often the fastest way to separate application failure from network failure. A debug pod or ephemeral container can confirm DNS resolution, service resolution, and port connectivity. If the pod can reach the service internally but not externally, the problem may be ingress, load balancer, or firewall related.
Common traffic failures include wrong labels, missing endpoints, port mismatches, and ingress misconfiguration. Another issue to check is NetworkPolicies. If a policy blocks traffic, everything can look healthy at the pod and service layer while connections still fail.
According to Kubernetes service networking documentation, Services route traffic by selector and endpoints, not by wishful thinking. If the selector does not match the pod labels, no backend is attached.
- Confirm Service selectors match pod labels exactly.
- Check whether endpoints exist for the Service.
- Validate targetPort and containerPort alignment.
- Review NetworkPolicies if traffic is blocked only between workloads.
For deeper cluster diagnostics, test DNS with nslookup or dig inside a debug pod, then test the application port with curl or nc. That sequence tells you whether the failure is name resolution, routing, or app responsiveness.
Using Rollouts And Revision History To Find Regression Sources
kubectl rollout history helps identify which deployment revision introduced a failure. If a service worked yesterday and broke after a release, revision history is one of the first places to look. It tells you whether the change came from the pod template, the image tag, or another manifest update.
Compare the current and previous ReplicaSets to see what changed. A new environment variable, image digest, probe setting, or resource limit can create a failure even when the application code itself is unchanged. Rollout metadata helps separate infrastructure changes from application regressions.
kubectl rollout undo is a practical rollback option when a recent release caused the problem. It is not a substitute for fixing the root cause, but it is often the fastest way to restore service. In an incident, restoring the last known good revision may be more valuable than spending twenty minutes debating the cause while the service stays down.
Always verify whether the issue is caused by application code, image changes, or manifest changes. A new container image with the same tag can behave differently if the build process changed. A manifest update can break readiness even when the code is untouched. Narrow the scope by testing one replica or one environment at a time when possible.
Kubernetes deployment documentation explains how revisions map to rollout state, and that matters in production because it lets you identify the exact change window.
- Use kubectl rollout history deployment/<name> to list revisions.
- Use kubectl rollout history deployment/<name> –revision=<n> to inspect details.
- Use kubectl rollout undo deployment/<name> to revert quickly if needed.
- Validate one environment before promoting the same manifest elsewhere.
Building A Repeatable Troubleshooting Workflow
A repeatable workflow prevents guesswork. Start with status, move to describe output, inspect logs, verify configuration, test networking, and review rollout history. That order works because each step narrows the problem space without jumping ahead.
Use label selectors to target the exact workload and avoid noise from unrelated pods. In a large namespace, unlabeled queries can hide the real issue behind healthy pods from other deployments. Precision matters, especially during incident response.
Record findings from events and logs so recurring patterns become visible. If the same service repeatedly fails on the same readiness probe or the same Secret key, the fix is probably structural, not random. Good notes turn into faster future response. They also help when a second engineer needs to continue the investigation.
Escalate to cluster-level checks when kubectl data points there. If pods are pending because no node can fit them, check node health. If mounts fail, investigate storage. If kubelet messages or node pressure appear, the issue may be outside the application namespace. A disciplined workflow tells you when to stop blaming the app.
The NIST NICE Workforce Framework emphasizes repeatable operational tasks and role-based troubleshooting skills. That same principle applies here: build a personal checklist you can use every time. Vision Training Systems recommends a short, reusable sequence so you can respond the same way under pressure.
- Check deployment, pod, and ReplicaSet status.
- Describe the failing pod and read the newest events first.
- Pull current logs and previous logs if containers restart.
- Verify image, config, Secret, and volume references.
- Test service reachability and network policy behavior.
- Review rollout history and compare revisions.
Pro Tip
Create a one-page troubleshooting checklist for your team. In a real incident, a checklist beats memory every time.
Conclusion
Most Kubernetes deployment failures fall into a small set of categories: scheduling problems, image pull issues, startup crashes, probe failures, configuration mistakes, and networking or service reachability problems. The right kubectl commands make each one easier to isolate. kubectl get shows what is unhealthy, kubectl describe explains why, kubectl logs shows what the application said, and kubectl rollout history shows what changed.
The practical pattern is always the same. Move from status to events to logs to configuration, then test service behavior and review rollout state. That sequence keeps you from guessing and helps you separate application issues from cluster issues. It also works across namespaces, teams, and environments.
Before you assume the code is broken, confirm the namespace, labels, and rollout state. Those three checks eliminate a surprising number of false leads. If the namespace is wrong, the label selector misses, or the rollout has not completed, the problem is often not where you first looked.
For teams that want to build stronger operational habits, Vision Training Systems can help you turn this process into a repeatable skill. If your staff needs a practical Kubernetes troubleshooting workflow, start there. The faster your team can diagnose deployment errors, the faster your platform recovers.