15:00

Troubleshooting Pod Startup Issues

In a dynamic Kubernetes environment, pods can sometimes fail to start correctly. Understanding how to diagnose these startup issues is a fundamental skill for any Kubernetes practitioner. Common failure states like ImagePullBackOff, CrashLoopBackOff, or a pod being perpetually stuck in Pending can bring applications down. This exercise provides a hands-on simulation of these common problems, guiding you through a systematic troubleshooting process using essential kubectl commands.

Scenario

You are the on-call engineer for a team that has just deployed a new application. Shortly after the deployment, you receive an alert that the application is down. Your mission is to investigate the Kubernetes cluster, identify why the application's pods are not running correctly, and apply the necessary fixes to bring the service back online. You will face three separate challenges, each representing a common real-world problem.

Requirements

This exercise consists of three challenges. You must diagnose and fix the startup issue for each of the three pods created in the Environment Setup section.

Challenge 1: The pod app-challenge-1 is in an ImagePullBackOff state. You must identify the cause and get the pod into a Running state.
Challenge 2: The pod app-challenge-2 is in a CrashLoopBackOff state. You must inspect the logs, identify the configuration error, and get the pod into a Running state.
Challenge 3: The pod app-challenge-3 is in a Pending state. You must identify the resource scheduling conflict and get the pod into a Running state.

Acceptance Criteria

The pod app-challenge-1, initially suffering from an ImagePullBackOff error, is successfully fixed and achieves a Running state.
The pod app-challenge-2, initially stuck in a CrashLoopBackOff loop, is successfully fixed by providing the required configuration and achieves a Running state.
The pod app-challenge-3, initially stuck in a Pending state due to resource constraints, is successfully scheduled and achieves a Running state after its definition is corrected.

Environment Setup

To begin the exercise, you must first create the problematic pods in your cluster. This YAML manifest defines three pods, each with a unique and common startup failure.

Create a file named broken-pods.yaml with the following content:

# broken-pods.yaml
apiVersion: v1
kind: Pod
metadata:
  name: app-challenge-1
spec:
  containers:
  - name: challenge-1-container
    image: nginxx:1.21.0
    ports:
    - containerPort: 80
---
apiVersion: v1
kind: Pod
metadata:
  name: app-challenge-2
spec:
  containers:
  - name: challenge-2-container
    image: busybox:1.35
    command: ["/bin/sh", "-c", "echo 'Configuration value is: $MY_CONFIG' && test -n '$MY_CONFIG' && sleep 3600"]
---
apiVersion: v1
kind: Pod
metadata:
  name: app-challenge-3
spec:
  containers:
  - name: challenge-3-container
    image: nginx:1.21.0
    ports:
    - containerPort: 80
    resources:
      requests:
        cpu: "1000"  # 1000 CPU cores - intentionally excessive to trigger scheduling failure

Apply the manifest to your cluster to create the pods:

kubectl apply -f broken-pods.yaml

Confirm that the pods have been created and are in their respective failure states:

kubectl get pods
# Expected output (the statuses and restart counts may vary depending on timing):
# NAME              READY   STATUS              RESTARTS   AGE
# app-challenge-1   0/1     ImagePullBackOff    0          15s
# app-challenge-2   0/1     CrashLoopBackOff    2          15s
# app-challenge-3   0/1     Pending             0          15s

Verification Note: It may take 30-60 seconds for the pods to reach their expected failure states. Run kubectl get pods -w to watch the status changes in real-time, and press Ctrl+C to exit the watch mode once all pods show their respective error states. Your task is to investigate each of the three challenge pods, diagnose the root cause of its failure, and apply a fix to get it into a Running state.

Resources

Possible Ways to Implement

The kubectl describe pod <pod-name> command is your most powerful tool. Pay close attention to the Events section at the bottom of its output.
For crashing pods, kubectl logs <pod-name> is essential. If the pod is restarting too quickly, use kubectl logs <pod-name> --previous to view logs from the previous (crashed) container run.
When a pod won't schedule, the scheduler's events will tell you why. Look for messages about insufficient resources or other scheduling constraints.
Note: We use specific image versions (nginx:1.21.0, busybox:1.35) for reproducibility. In production environments, always:
- Verify image availability and security status before deployment
- Use images from trusted registries
- Check for the latest stable versions from official sources
- Consider using image vulnerability scanning tools

Real-World Significance

Pod startup failures are one of the most common issues you will face in a production Kubernetes environment. The three scenarios covered here—ImagePullBackOff, CrashLoopBackOff, and Pending due to resource constraints—represent a huge percentage of these real-world incidents. By mastering the kubectl describe and kubectl logs commands, you gain a systematic and effective method for diagnosing problems. This allows you to quickly identify the root cause, whether it's a simple typo, a missing configuration, or a resource allocation issue, and restore service with confidence. This skill is fundamental to being a reliable and effective Kubernetes operator.