Mastering Kubernetes Deployment Troubleshooting With Kubectl

Vision Training Systems – On-demand IT Training

April 22, 2026

Introduction

Kubernetes deployments fail for ordinary reasons, not mysterious ones. A manifest can validate cleanly and still produce broken rollouts because the image tag is wrong, a probe is too aggressive, a Secret is missing, or a pod cannot be scheduled onto any node. That is why troubleshooting in Kubernetes is a process, not a guess.

This guide focuses on practical kubectl commands you can use under pressure. The goal is simple: determine whether the failure sits in scheduling, image pulling, application startup, networking, configuration, or resource limits. Once you know the layer, the fix becomes much faster.

You do not need to inspect every object in the cluster. You need a repeatable path that starts with deployment errors, moves through pod events and logs, and ends with the exact control point causing the outage. That is the difference between flailing and real cluster diagnostics.

For teams running production workloads, that discipline matters. CNCF adoption data and vendor documentation show how much operational complexity sits inside Kubernetes-based systems, which is why a fast diagnostic method saves time every week. Vision Training Systems teaches this same workflow mindset: identify the layer, confirm the symptom, then isolate the root cause.

Understanding The Deployment Problem Space

A Deployment manages rollout strategy and desired state. A ReplicaSet ensures the requested number of pod replicas exist. A Pod is the smallest deployable unit. A container is the runtime process inside the pod. Each layer can fail differently, and Kubernetes reports those failures at different points in the chain.

That matters because a healthy Deployment does not always mean healthy pods. A ReplicaSet can exist while pods stay pending. A pod can run while the container crashes. A container can start while the service remains unreachable. The status line you see is often a clue, not the full answer.

Common failure patterns include Pending, CrashLoopBackOff, ImagePullBackOff, and running pods that still cannot serve traffic. Pending usually points to scheduling or resource pressure. Image pull failures usually point to image names, registry access, or network access. Crash loops usually point to app startup, configuration, or probe problems.

Always check the namespace first. Many “missing resource” issues are really scope issues. If you query the wrong namespace, the deployment may look absent even though it is present and failing elsewhere. That simple mistake wastes time during a production incident.

Note

Kubernetes status values are symptoms, not final diagnoses. Treat them as the start of cluster diagnostics, not the end.

For a useful mental model, read each layer as a dependency chain. If the Deployment is fine but the ReplicaSet is stale, the rollout may be blocked. If the ReplicaSet is fine but the Pod is pending, the scheduler is likely the bottleneck. If the Pod is running but the container is crash looping, the application or its runtime settings are the issue.

Starting With High-Level Status Checks for Kubernetes Troubleshooting

Start broad. Use kubectl get deployments, kubectl get pods, and kubectl get rs to see what is unhealthy before you dive into details. These commands give you a quick comparison of desired state versus actual state. If the numbers do not match, you already know the rollout is incomplete.

The READY column is especially useful. A deployment showing 3/3 ready is healthy in a way that 0/3 or 2/3 is not. The AGE column also helps. A workload created two minutes ago that is still not ready deserves immediate attention. A workload that has been stable for days and suddenly degraded likely changed recently or hit a new dependency failure.

kubectl describe deployment gives you events, rollout conditions, and replica information in one place. It often reveals whether the rollout is stalled, whether old ReplicaSets still exist, and whether Kubernetes is waiting for pods to become available. For quick confirmation, kubectl rollout status tells you whether the Deployment is progressing, stuck, or timed out.

A practical check is to compare desired replicas and available replicas. If desired is 6 and available is 0, you have a total failure. If desired is 6 and available is 5, the issue may be partial, localized, or intermittent. That distinction matters because the fix may be different.

Use kubectl get deployments -n <namespace> to check desired versus available replicas.
Use kubectl get pods -n <namespace> to see pod phase and restart counts.
Use kubectl get rs -n <namespace> to identify whether a new ReplicaSet was created during rollout.
Use kubectl rollout status deployment/<name> -n <namespace> to confirm progress or timeout.

Kubernetes rarely “breaks” in one place. The fastest fix usually comes from tracing the first unhealthy object in the chain.

Inspecting Pod Events And Conditions

kubectl describe pod is one of the most useful troubleshooting commands in Kubernetes. It shows pod metadata, container states, conditions, volumes, environment references, and the event stream. That event stream often contains the exact reason the pod is not behaving as expected.

Look for event messages about image pull failures, scheduling constraints, probe failures, or volume mount issues. For example, “Failed to pull image” points you to registry access or a bad image reference. “0/3 nodes are available” points to cluster capacity, taints, or affinity rules. “Readiness probe failed” often means the app is live but not ready to serve traffic yet.

Pod conditions matter too. Initialized means init containers completed. PodScheduled means the scheduler assigned a node. ContainersReady means the containers passed readiness checks. Ready means the pod can receive traffic according to Kubernetes. A pod can be running and still not be Ready.

Timestamps and repeated events help you determine whether the problem is persistent or intermittent. If the same warning repeats every few seconds, the failure is probably continuous. If events stop after a single occurrence, you may be seeing a transient issue. Always check the most recent events first, especially when a pod has many warnings or restart cycles.

Pro Tip

When pod events are noisy, focus on the newest failure messages and the first event after scheduling. That sequence usually points to the root cause faster than reading from the top.

Useful commands include:

kubectl describe pod <pod-name> -n <namespace>
kubectl get pod <pod-name> -n <namespace> -o wide
kubectl get events -n <namespace> –sort-by=.lastTimestamp

Diagnosing Image And Container Startup Issues

ImagePullBackOff and ErrImagePull usually mean the container image could not be retrieved. Common causes include a wrong image name, a bad tag, missing private registry credentials, or network problems reaching the registry. If the image is spelled correctly but the tag does not exist, Kubernetes will keep retrying and backing off.

Verify the image reference in the Deployment manifest and compare it with the running pod spec. A deployment might point to app:v2 while the pod spec is still using app:v1 because the rollout has not completed. That difference matters when you are chasing a behavior change after a release.

CrashLoopBackOff means the container starts, exits, and then gets restarted repeatedly. This is not an image problem. It is usually an application startup problem, a missing environment variable, a bad config file, or a command/entrypoint issue. Use kubectl logs to inspect output, and use kubectl logs –previous when the container has already restarted.

Also check the container command, args, entrypoint behavior, and environment variables. A good image can fail immediately if the command expects a configuration file that is not mounted, or if an environment variable is empty. In real incidents, this is one of the most common sources of deployment errors.

According to the Kubernetes documentation, container status and event messages are the primary indicators for image and runtime failures. That official guidance matches what operators see in production: the container tells you what failed if you read the logs carefully.

Check kubectl describe pod for pull and startup events.
Check kubectl logs <pod> for current container output.
Check kubectl logs –previous <pod> for the last crashed instance.
Compare the deployment image tag with the image actually running in the pod.

Warning

Do not assume a CrashLoopBackOff is caused by Kubernetes itself. In many cases, the app is failing before it ever becomes ready enough for traffic.

Checking Resource Limits, Probes, And Scheduling Constraints

Insufficient CPU or memory requests can delay scheduling or trigger node pressure issues. If a pod requests more resources than any node can provide, it will sit pending. If limits are too tight, the container may be throttled or killed by the kernel, leading to instability and restarts.

kubectl describe pod is the best way to see failed scheduling reasons such as taints, node selectors, affinity rules, or insufficient resources. The scheduler often explains exactly why it rejected a node. If every node fails the same predicate, the issue is usually in the pod specification, not the cluster.

Read liveness, readiness, and startup probes separately. Liveness probes restart a container that appears unhealthy. Readiness probes keep traffic away until the app is ready. Startup probes give slow-starting applications time to initialize. If a liveness probe is too aggressive, it can kill a container that would have become healthy a few seconds later.

Probe failures show up in events. Correlate those timestamps with the application startup time. If your app needs 90 seconds to warm a cache and your startup probe allows only 30 seconds, repeated restarts are expected. For a slower Java or .NET service, that mismatch is a common root cause.

Resource limits can also produce OOMKilled states. That means the process exceeded its memory limit and was terminated. CPU throttling is less visible but can still cause timeouts, failed readiness checks, and poor responsiveness under load.

The Kubernetes resource management documentation explains how requests and limits influence scheduling and runtime behavior. For operators, the practical lesson is simple: make resource settings reflect reality, not hope.

Problem Pattern	Likely Cause
Pending pod with no node assigned	Insufficient resources, taints, affinity mismatch
Running pod that keeps restarting	OOMKilled, bad command, failing probe, app crash
Pod marked unready but not restarting	Readiness probe failure or dependency outage

Use these checks to separate cluster-level scheduling problems from application-level startup issues. That distinction shortens troubleshooting immediately.

Investigating Configuration, Secrets, And Environment Problems

Incorrect ConfigMaps, Secrets, or environment variable values can break a container that otherwise appears healthy. The image is fine. The runtime is fine. The application simply cannot find the values it needs to start correctly.

Use kubectl describe pod to inspect mounted volumes and environment references. Use kubectl get deployment -o yaml when you need the full template, because the deployment often shows the exact config keys, secret names, mount paths, and variable references. That is where many deployment errors hide.

Missing keys, wrong file paths, or incorrectly encoded Secret values often appear as application startup errors. For example, a pod may look healthy from Kubernetes’ point of view while the application logs report “config file not found” or “invalid token format.” Base64 encoding mistakes are especially common when Secrets are created manually.

Always verify namespace consistency. A ConfigMap in one namespace does not satisfy a pod in another. A Secret that exists in dev will not magically work in prod. This sounds obvious, but namespace drift causes a surprising number of support tickets.

When updating configuration, do it safely. Confirm whether the application reloads config dynamically or needs a restart to pick up changes. If it requires a restart, recycle the pods after the update so you are testing the new data, not the old cached values.

Key Takeaway

When a container fails without a Kubernetes-level error, configuration is a top suspect. Check namespaces, keys, paths, and restart behavior before changing the image.

Inspect environment values in the pod spec.
Confirm ConfigMap and Secret names exist in the same namespace.
Validate mount paths and file permissions.
Restart pods if the app only reads configuration on startup.

Debugging Networking And Service Reachability

A pod can be running successfully while the service is still unreachable. That is a common source of confusion in Kubernetes. The pod may be healthy, but the Service selector does not match labels, the endpoints are empty, or ingress rules are misconfigured.

Use kubectl get svc, kubectl describe svc, and kubectl get endpoints to check whether the Service is attached to backend pods. If the service has no endpoints, traffic cannot be routed, even if the pods themselves are fine. Label mismatches are one of the most frequent causes.

Testing from inside the cluster is often the fastest way to separate application failure from network failure. A debug pod or ephemeral container can confirm DNS resolution, service resolution, and port connectivity. If the pod can reach the service internally but not externally, the problem may be ingress, load balancer, or firewall related.

Common traffic failures include wrong labels, missing endpoints, port mismatches, and ingress misconfiguration. Another issue to check is NetworkPolicies. If a policy blocks traffic, everything can look healthy at the pod and service layer while connections still fail.

According to Kubernetes service networking documentation, Services route traffic by selector and endpoints, not by wishful thinking. If the selector does not match the pod labels, no backend is attached.

Confirm Service selectors match pod labels exactly.
Check whether endpoints exist for the Service.
Validate targetPort and containerPort alignment.
Review NetworkPolicies if traffic is blocked only between workloads.

For deeper cluster diagnostics, test DNS with nslookup or dig inside a debug pod, then test the application port with curl or nc. That sequence tells you whether the failure is name resolution, routing, or app responsiveness.

Using Rollouts And Revision History To Find Regression Sources

kubectl rollout history helps identify which deployment revision introduced a failure. If a service worked yesterday and broke after a release, revision history is one of the first places to look. It tells you whether the change came from the pod template, the image tag, or another manifest update.

Compare the current and previous ReplicaSets to see what changed. A new environment variable, image digest, probe setting, or resource limit can create a failure even when the application code itself is unchanged. Rollout metadata helps separate infrastructure changes from application regressions.

kubectl rollout undo is a practical rollback option when a recent release caused the problem. It is not a substitute for fixing the root cause, but it is often the fastest way to restore service. In an incident, restoring the last known good revision may be more valuable than spending twenty minutes debating the cause while the service stays down.

Always verify whether the issue is caused by application code, image changes, or manifest changes. A new container image with the same tag can behave differently if the build process changed. A manifest update can break readiness even when the code is untouched. Narrow the scope by testing one replica or one environment at a time when possible.

Kubernetes deployment documentation explains how revisions map to rollout state, and that matters in production because it lets you identify the exact change window.

Use kubectl rollout history deployment/<name> to list revisions.
Use kubectl rollout history deployment/<name> –revision=<n> to inspect details.
Use kubectl rollout undo deployment/<name> to revert quickly if needed.
Validate one environment before promoting the same manifest elsewhere.

Building A Repeatable Troubleshooting Workflow

A repeatable workflow prevents guesswork. Start with status, move to describe output, inspect logs, verify configuration, test networking, and review rollout history. That order works because each step narrows the problem space without jumping ahead.

Use label selectors to target the exact workload and avoid noise from unrelated pods. In a large namespace, unlabeled queries can hide the real issue behind healthy pods from other deployments. Precision matters, especially during incident response.

Record findings from events and logs so recurring patterns become visible. If the same service repeatedly fails on the same readiness probe or the same Secret key, the fix is probably structural, not random. Good notes turn into faster future response. They also help when a second engineer needs to continue the investigation.

Escalate to cluster-level checks when kubectl data points there. If pods are pending because no node can fit them, check node health. If mounts fail, investigate storage. If kubelet messages or node pressure appear, the issue may be outside the application namespace. A disciplined workflow tells you when to stop blaming the app.

The NIST NICE Workforce Framework emphasizes repeatable operational tasks and role-based troubleshooting skills. That same principle applies here: build a personal checklist you can use every time. Vision Training Systems recommends a short, reusable sequence so you can respond the same way under pressure.

Check deployment, pod, and ReplicaSet status.
Describe the failing pod and read the newest events first.
Pull current logs and previous logs if containers restart.
Verify image, config, Secret, and volume references.
Test service reachability and network policy behavior.
Review rollout history and compare revisions.

Pro Tip

Create a one-page troubleshooting checklist for your team. In a real incident, a checklist beats memory every time.

Conclusion

Most Kubernetes deployment failures fall into a small set of categories: scheduling problems, image pull issues, startup crashes, probe failures, configuration mistakes, and networking or service reachability problems. The right kubectl commands make each one easier to isolate. kubectl get shows what is unhealthy, kubectl describe explains why, kubectl logs shows what the application said, and kubectl rollout history shows what changed.

The practical pattern is always the same. Move from status to events to logs to configuration, then test service behavior and review rollout state. That sequence keeps you from guessing and helps you separate application issues from cluster issues. It also works across namespaces, teams, and environments.

Before you assume the code is broken, confirm the namespace, labels, and rollout state. Those three checks eliminate a surprising number of false leads. If the namespace is wrong, the label selector misses, or the rollout has not completed, the problem is often not where you first looked.

For teams that want to build stronger operational habits, Vision Training Systems can help you turn this process into a repeatable skill. If your staff needs a practical Kubernetes troubleshooting workflow, start there. The faster your team can diagnose deployment errors, the faster your platform recovers.

Common Questions For Quick Answers

What are the first kubectl commands to use when a deployment is failing?

Start by checking the deployment, ReplicaSets, and pods so you can identify where the rollout is breaking. A common troubleshooting flow is to inspect the rollout status, review pod readiness, and then look at the most recent events. Commands such as kubectl rollout status, kubectl get deploy, kubectl get pods, and kubectl describe pod quickly show whether the issue is in scheduling, image pulling, or container startup.

After that, inspect logs for any container that has started but is not behaving correctly. Use kubectl logs and, if needed, kubectl logs -p to view the previous crash when a pod is restarting. This helps separate deployment-level problems from application-level failures. In practice, these basic kubectl commands often reveal the root cause faster than changing the manifest immediately.

How can I tell whether a Kubernetes deployment is failing because of scheduling or image pull issues?

Scheduling problems usually appear before a container ever starts, while image pull issues happen after Kubernetes has found a node but cannot retrieve the container image. To distinguish them, run kubectl describe pod and inspect the Events section carefully. Messages such as FailedScheduling point to resource pressure, node selectors, affinity rules, taints, or missing tolerations. Messages such as ErrImagePull or ImagePullBackOff point to image name, registry, tag, or credential problems.

You can also use kubectl get pods -o wide to see whether the pod is assigned to a node. If it remains in Pending, scheduling is usually the first place to investigate. If it is assigned but not running, check the container state and events for pull errors. This distinction matters because the fix is very different: one requires adjusting placement or resources, and the other requires correcting the image reference or registry access.

Why does a deployment get stuck in CrashLoopBackOff even when the manifest looks valid?

A manifest can be syntactically correct and still deploy a container that exits immediately or fails health checks. CrashLoopBackOff usually means the pod starts, crashes, and Kubernetes retries it with increasing delays. Common causes include bad environment variables, missing Secrets or ConfigMaps, application startup exceptions, port mismatches, or an overly strict readiness or liveness probe. The deployment itself may be fine; the workload inside the container is what is failing.

Use kubectl describe pod to inspect the restart reason, probe failures, and recent events. Then use kubectl logs to read the application output and kubectl logs -p to capture the previous crashed instance. If the pod dies too quickly, probe settings may be too aggressive, so increasing initial delay or failure thresholds can help. Troubleshooting CrashLoopBackOff is often about separating application startup problems from Kubernetes health-check behavior.

How do liveness and readiness probes affect deployment troubleshooting?

Readiness and liveness probes serve different purposes, and confusing them can make a healthy application look broken. A readiness probe controls whether a pod receives traffic, so a failing readiness probe usually keeps the pod out of service but does not restart it. A liveness probe checks whether the container should be restarted, so repeated failures can trigger restarts and create a CrashLoopBackOff pattern. During deployment troubleshooting, probe settings are often the difference between a slow rollout and a failed rollout.

When diagnosing probe-related issues, inspect the pod events and the application logs together. If startup takes longer than expected, the probe may be checking too early. If the endpoint only works after dependencies initialize, you may need a longer initial delay or a more accurate health endpoint. Good probe design should reflect real application behavior, not just the container’s ability to start. This is one of the most common best-practice fixes in Kubernetes deployment troubleshooting.

What kubectl checks help identify missing ConfigMaps or Secrets during a deployment?

Missing ConfigMaps or Secrets often cause pods to fail at startup, especially when the container expects environment variables, mounted files, or credentials that are not present. Begin with kubectl describe pod, because the Events section often shows errors such as missing references or mount failures. If a pod refers to a ConfigMap or Secret that does not exist in the namespace, Kubernetes will usually make that clear there. You can then verify the object directly with kubectl get configmap or kubectl get secret.

It also helps to compare the namespace used by the deployment with the namespace where the ConfigMap or Secret was created. A resource existing in the cluster is not enough if it is in the wrong namespace. For mounted secrets, check both the key names and the volume path. For environment variables, confirm that the referenced names match exactly. These checks are essential because missing configuration often looks like an application bug until you inspect the pod spec and events closely.

Get the best prices on our best selling courses on Udemy.

Explore our discounted courses today! >>

Start learning today with our
365 Training Pass

*A valid email address and contact information is required to receive the login information to access your free 10 day access. Only one free 10 day access account per user is permitted. No credit card is required.

Mastering Kubernetes Deployment Troubleshooting With Kubectl

Introduction

Understanding The Deployment Problem Space

Starting With High-Level Status Checks for Kubernetes Troubleshooting

Inspecting Pod Events And Conditions

Diagnosing Image And Container Startup Issues

Checking Resource Limits, Probes, And Scheduling Constraints

Investigating Configuration, Secrets, And Environment Problems

Debugging Networking And Service Reachability

Using Rollouts And Revision History To Find Regression Sources

Building A Repeatable Troubleshooting Workflow

Conclusion

Common Questions For Quick Answers

More Blog Posts

The Role of CEUs in Advancing Your Cybersecurity Career

Navigating Cisco Certifications: CCNA vs. CCNP and Which Path Fits Your Career

How to Pass the AWS CSA Exam: Common Pitfalls and How to Avoid Them

Google Associate Workspace Administrator – AWA Free Practice Test

What is a Python Interpreter?

The Complete Guide to Understanding the Software Development Life Cycle (SDLC)

Zero Trust and Endpoint Security: Protecting Devices Beyond the Firewall

Deep Dive Into Google Cloud’s Data Loss Prevention (DLP) API for Sensitive Data Protection

Practical Guide to Securing Windows Server Networks With Firewall and VPN Solutions

Setting Up a Virtual Private Cloud (VPC) in AWS: A Complete Technical Walkthrough

Mastering Kubernetes Deployment Troubleshooting With Kubectl

Introduction

Understanding The Deployment Problem Space

Starting With High-Level Status Checks for Kubernetes Troubleshooting

Inspecting Pod Events And Conditions

Diagnosing Image And Container Startup Issues

Checking Resource Limits, Probes, And Scheduling Constraints

Investigating Configuration, Secrets, And Environment Problems

Debugging Networking And Service Reachability

Using Rollouts And Revision History To Find Regression Sources

Building A Repeatable Troubleshooting Workflow

Conclusion

Related Posts

Common Questions For Quick Answers

More Blog Posts