name: k8s:pods description: Debug and troubleshoot pod issues including crashes, failures, networking, and resource problems

Troubleshoot Pods Skill

This skill provides systematic approaches to debugging pod issues in the Kagenti platform.

Context-Safe Execution (MANDATORY)

All kubectl/oc commands MUST redirect output to files. Commands below are shown in bare form for readability. When executing, always redirect:

export LOG_DIR=/tmp/kagenti/k8s/${CLUSTER:-local}
mkdir -p $LOG_DIR

# Pattern for all kubectl commands:
kubectl <command> > $LOG_DIR/<descriptive-name>.log 2>&1 && echo "OK" || echo "FAIL (see $LOG_DIR/<descriptive-name>.log)"

# Analyze in subagent: Task(subagent_type='Explore') to read log files
# Use subagents for BOTH failure analysis AND verifying expected behavior

When to Use

Pods are crashlooping or failing
Pods stuck in Pending, ImagePullBackOff, or other error states
User reports application not working
After kagenti:deploy to verify pods are healthy
Investigating resource issues

Quick Pod Status Check

# All pods with status
kubectl get pods -A -o wide

# Only problematic pods
kubectl get pods -A | grep -vE "Running|Completed"

# Pods sorted by restarts
kubectl get pods -A --sort-by='.status.containerStatuses[0].restartCount' | tail -20

# Pods with high restart count
kubectl get pods -A | awk '$4 > 3 {print $0}'

# Recent pod events
kubectl get events -A --sort-by='.lastTimestamp' | tail -30

Common Pod States

Running - Healthy ✓

Pod is running normally. Check if app inside is working correctly.

Pending - Waiting for resources ⏳

# Check why pod is pending
kubectl describe pod <pod-name> -n <namespace>

# Common causes:
# - Insufficient CPU/memory on nodes
# - Unbound PersistentVolumeClaim
# - Node selector not matching any nodes
# - Image pull in progress

# Check node resources
kubectl top nodes
kubectl describe nodes

CrashLoopBackOff - Application crashing ❌

# Check logs before crash
kubectl logs -n <namespace> <pod-name> --previous

# Check current logs
kubectl logs -n <namespace> <pod-name>

# Check pod events
kubectl describe pod -n <namespace> <pod-name>

# Common causes:
# - Application error on startup
# - Missing configuration/secrets
# - Failed liveness probe
# - Dependency not available

ImagePullBackOff - Cannot pull image ❌

# Check pod events for exact error
kubectl describe pod -n <namespace> <pod-name>

# Check if image exists in registry
docker pull <image-name>

# For Kind cluster, load image manually
kind load docker-image <image-name> --name agent-platform

# Check if image is in Kind cluster
docker exec agent-platform-control-plane crictl images | grep <image-name>

# Common causes:
# - Image doesn't exist
# - Wrong image tag
# - No access to registry (auth)
# - Network issues

Error - Container exited with error ❌

# Check exit code and reason
kubectl describe pod -n <namespace> <pod-name> | grep -A5 "State:"

# Check logs
kubectl logs -n <namespace> <pod-name> --previous

# Common exit codes:
# 0 - Success
# 1 - General error
# 137 - SIGKILL (OOM killed)
# 143 - SIGTERM (terminated)

OOMKilled - Out of memory ❌

# Check for OOM in events
kubectl get events -A | grep -i "OOMKilled"

# Check pod memory limits
kubectl describe pod -n <namespace> <pod-name> | grep -A10 "Limits:"

# Check actual memory usage
kubectl top pod -n <namespace> <pod-name>

# Fix: Increase memory limits
kubectl edit deployment -n <namespace> <deployment-name>
# Increase resources.limits.memory and resources.requests.memory

Systematic Troubleshooting

Step 1: Get Pod Details

# Get pod status
kubectl get pod -n <namespace> <pod-name>

# Get full pod description
kubectl describe pod -n <namespace> <pod-name>

# Check pod YAML
kubectl get pod -n <namespace> <pod-name> -o yaml

# Check pod events
kubectl get events -n <namespace> --field-selector involvedObject.name=<pod-name>

Step 2: Check Logs

# Current logs
kubectl logs -n <namespace> <pod-name>

# Previous logs (if crashed)
kubectl logs -n <namespace> <pod-name> --previous

# All containers (including sidecars)
kubectl logs -n <namespace> <pod-name> --all-containers=true

# Specific container
kubectl logs -n <namespace> <pod-name> -c <container-name>

# Follow logs
kubectl logs -n <namespace> <pod-name> -f --tail=20

Step 3: Check Resource Constraints

# Check resource usage
kubectl top pod -n <namespace> <pod-name>

# Check resource limits
kubectl describe pod -n <namespace> <pod-name> | grep -A10 "Limits:"
kubectl describe pod -n <namespace> <pod-name> | grep -A10 "Requests:"

# Check node resources
kubectl top nodes

Step 4: Check Configuration

# Check environment variables
kubectl describe pod -n <namespace> <pod-name> | grep -A20 "Environment:"

# Check mounted secrets
kubectl describe pod -n <namespace> <pod-name> | grep -A10 "Mounts:"

# Verify secret exists
kubectl get secret -n <namespace> <secret-name>

# Check configmap
kubectl get configmap -n <namespace> <configmap-name>

Step 5: Check Networking

# Check service endpoints
kubectl get endpoints -n <namespace> <service-name>

# Check if pod is in service
kubectl get endpoints -n <namespace> <service-name> -o yaml

# Test connectivity FROM the pod
kubectl exec -n <namespace> <pod-name> -- curl -I http://<service-name>

# Test connectivity TO the pod
kubectl run debug-curl --image=curlimages/curl --rm -it -- \
  curl http://<pod-ip>:<port>

# Check network policies
kubectl get networkpolicy -n <namespace>

Component-Specific Troubleshooting

Weather Tool / Weather Service

# Check deployment
kubectl get deployment -n team1 weather-tool
kubectl describe deployment -n team1 weather-tool

# Check pods
kubectl get pods -n team1 -l app=weather-tool

# Check service endpoints
kubectl get endpoints -n team1 weather-tool

# Test MCP endpoint (weather-tool)
kubectl exec -n team1 deployment/weather-tool -- \
  curl -I http://localhost:8000/health || echo "Health check failed"

# Check for API errors (weather service)
kubectl logs -n team1 deployment/weather-service | grep -iE "api|error|openai"

Keycloak

# Check if Keycloak is deployment or statefulset
kubectl get deployment -n keycloak keycloak 2>/dev/null || kubectl get statefulset -n keycloak keycloak

# Check pod status
kubectl get pods -n keycloak -l app=keycloak

# Check readiness
kubectl exec -n keycloak deployment/keycloak -c keycloak -- \
  curl -sf http://localhost:8080/health/ready || echo "Not ready"

# Check PostgreSQL dependency
kubectl get pods -n keycloak -l app=postgresql
kubectl logs -n keycloak deployment/postgresql --tail=50 2>/dev/null

# Common issues:
# - PostgreSQL not ready
# - Database connection failures
# - Memory limits too low (increase to 1Gi)

Platform Operator

# Check operator deployment
kubectl get deployment -n kagenti-system -l control-plane=controller-manager

# Check operator pods
kubectl get pods -n kagenti-system -l control-plane=controller-manager

# Check operator logs for errors
kubectl logs -n kagenti-system -l control-plane=controller-manager | \
  grep -iE "error|fail"

# Check Component CRD processing
kubectl get components -A
kubectl describe component -n <namespace> <component-name>

# Check if operator is reconciling
kubectl logs -n kagenti-system -l control-plane=controller-manager --tail=50 | \
  grep -i "reconcile"

Istio Sidecars

# Check if sidecar is injected
kubectl get pod -n <namespace> <pod-name> -o jsonpath='{.spec.containers[*].name}'
# Should show: <app-container> istio-proxy

# Check sidecar status
kubectl get pod -n <namespace> <pod-name> -o jsonpath='{.status.containerStatuses[?(@.name=="istio-proxy")].ready}'
# Should show: true

# Check sidecar logs
kubectl logs -n <namespace> <pod-name> -c istio-proxy

# Common issues:
# - Sidecar not injected (check namespace label)
# - mTLS errors (check certificates)
# - Connection failures (check virtual services)

Interactive Debugging

Execute Commands in Pod

# Get shell access
kubectl exec -n <namespace> <pod-name> -it -- /bin/sh
# or
kubectl exec -n <namespace> <pod-name> -it -- /bin/bash

# Run specific command
kubectl exec -n <namespace> <pod-name> -- ls -la /app
kubectl exec -n <namespace> <pod-name> -- env
kubectl exec -n <namespace> <pod-name> -- cat /etc/resolv.conf

# Test network connectivity
kubectl exec -n <namespace> <pod-name> -- ping <service-name>
kubectl exec -n <namespace> <pod-name> -- curl http://<service-name>:<port>
kubectl exec -n <namespace> <pod-name> -- nslookup <service-name>

Debug with Temporary Pods

# Create debug pod in same namespace
kubectl run debug-pod -n <namespace> --image=busybox --rm -it -- sh

# Test network connectivity
kubectl run debug-curl -n <namespace> --image=curlimages/curl --rm -it -- \
  curl -v http://<service-name>:<port>

# Test DNS resolution
kubectl run debug-dns -n <namespace> --image=busybox --rm -it -- \
  nslookup <service-name>

# Check pod-to-pod connectivity
kubectl run debug-net -n <namespace> --image=nicolaka/netshoot --rm -it -- \
  curl http://<pod-ip>:<port>

Restart and Recovery

Restart Pod

# Delete pod (deployment will recreate)
kubectl delete pod -n <namespace> <pod-name>

# Restart deployment (all pods)
kubectl rollout restart deployment -n <namespace> <deployment-name>

# Scale to zero and back (forces recreation)
kubectl scale deployment -n <namespace> <deployment-name> --replicas=0
kubectl scale deployment -n <namespace> <deployment-name> --replicas=1

Force Redeploy

# Update deployment to force new pods
kubectl patch deployment -n <namespace> <deployment-name> \
  -p "{\"spec\":{\"template\":{\"metadata\":{\"annotations\":{\"kubectl.kubernetes.io/restartedAt\":\"$(date +%Y-%m-%dT%H:%M:%S)\"}}}}}"

# Check rollout status
kubectl rollout status deployment -n <namespace> <deployment-name>

Rollback Deployment

# Check deployment history
kubectl rollout history deployment -n <namespace> <deployment-name>

# Rollback to previous version
kubectl rollout undo deployment -n <namespace> <deployment-name>

# Rollback to specific revision
kubectl rollout undo deployment -n <namespace> <deployment-name> --to-revision=2

Resource Adjustments

Increase Memory/CPU

# Edit deployment
kubectl edit deployment -n <namespace> <deployment-name>

# Find resources section and update:
# resources:
#   requests:
#     memory: "256Mi"
#     cpu: "100m"
#   limits:
#     memory: "512Mi"
#     cpu: "500m"

# Or patch directly
kubectl patch deployment -n <namespace> <deployment-name> -p \
  '{"spec":{"template":{"spec":{"containers":[{"name":"<container-name>","resources":{"limits":{"memory":"1Gi"}}}]}}}}'

Common Issues and Fixes

Issue: Pod stuck in Pending

Cause: Insufficient resources

Fix:

kubectl top nodes
kubectl describe nodes
# Scale down other pods or add resources to Kind cluster

Issue: CrashLoopBackOff

Cause: Application startup failure

Fix:

kubectl logs -n <namespace> <pod-name> --previous
# Fix configuration, secrets, or application code
# Redeploy

Issue: ImagePullBackOff

Cause: Image not available

Fix:

# Load image into Kind
kind load docker-image <image-name> --name agent-platform

# Or fix image name in deployment
kubectl edit deployment -n <namespace> <deployment-name>

Issue: Service has no endpoints

Cause: Pods not matching service selector

Fix:

# Check service selector
kubectl get svc -n <namespace> <service-name> -o yaml | grep -A5 selector

# Check pod labels
kubectl get pods -n <namespace> --show-labels

# Fix labels in deployment
kubectl edit deployment -n <namespace> <deployment-name>

Issue: Pod can't connect to other services

Cause: Network policy or DNS issues

Fix:

# Test DNS
kubectl exec -n <namespace> <pod-name> -- nslookup <service-name>

# Test connectivity
kubectl exec -n <namespace> <pod-name> -- curl http://<service-name>:<port>

# Check network policies
kubectl get networkpolicy -n <namespace>
kubectl describe networkpolicy -n <namespace> <policy-name>

Pro Tips

Start with describe: kubectl describe pod shows most common issues
Check previous logs: For crashes, --previous is essential
Use debug pods: Temporary pods help test networking
Check events: Recent events often reveal the issue
Verify resources: Memory/CPU limits cause many issues
Test step by step: Isolate each component
Check dependencies: Pods may fail if dependencies aren't ready
Look at sidecars: Istio proxy logs show networking issues
Use exec for testing: Run commands in pods to debug
Restart if stuck: Sometimes restart clears transient issues

Related Skills

k8s:health: Check overall platform health
k8s:logs: Detailed log analysis
kagenti:deploy: Full cluster redeploy if needed

Sign in to Comment

k8spods

name: k8s:pods description: Debug and troubleshoot pod issues including crashes, failures, networking, and resource problems

Troubleshoot Pods Skill

Context-Safe Execution (MANDATORY)

When to Use

Quick Pod Status Check

Common Pod States

Running - Healthy ✓

Pending - Waiting for resources ⏳

CrashLoopBackOff - Application crashing ❌

ImagePullBackOff - Cannot pull image ❌

Error - Container exited with error ❌

OOMKilled - Out of memory ❌

Systematic Troubleshooting

Step 1: Get Pod Details

Step 2: Check Logs

Step 3: Check Resource Constraints

Step 4: Check Configuration

Step 5: Check Networking

Component-Specific Troubleshooting

Weather Tool / Weather Service

Keycloak

Platform Operator

Istio Sidecars

Interactive Debugging

Execute Commands in Pod

Debug with Temporary Pods

Restart and Recovery

Restart Pod

Force Redeploy

Rollback Deployment

Resource Adjustments

Increase Memory/CPU

Common Issues and Fixes

Issue: Pod stuck in Pending

Issue: CrashLoopBackOff

Issue: ImagePullBackOff

Issue: Service has no endpoints

Issue: Pod can't connect to other services

Pro Tips

Related Skills

chat Comments (0)

Skill Details

Related Skills

fabric

typescript-expert

break-loop

burp-suite

page-behavior-audit

Build your own?