name: k8s:pods description: Debug and troubleshoot pod issues including crashes, failures, networking, and resource problems
Troubleshoot Pods Skill
This skill provides systematic approaches to debugging pod issues in the Kagenti platform.
Context-Safe Execution (MANDATORY)
All kubectl/oc commands MUST redirect output to files. Commands below are shown in bare form for readability. When executing, always redirect:
export LOG_DIR=/tmp/kagenti/k8s/${CLUSTER:-local}
mkdir -p $LOG_DIR
# Pattern for all kubectl commands:
kubectl <command> > $LOG_DIR/<descriptive-name>.log 2>&1 && echo "OK" || echo "FAIL (see $LOG_DIR/<descriptive-name>.log)"
# Analyze in subagent: Task(subagent_type='Explore') to read log files
# Use subagents for BOTH failure analysis AND verifying expected behavior
When to Use
- Pods are crashlooping or failing
- Pods stuck in Pending, ImagePullBackOff, or other error states
- User reports application not working
- After
kagenti:deployto verify pods are healthy - Investigating resource issues
Quick Pod Status Check
# All pods with status
kubectl get pods -A -o wide
# Only problematic pods
kubectl get pods -A | grep -vE "Running|Completed"
# Pods sorted by restarts
kubectl get pods -A --sort-by='.status.containerStatuses[0].restartCount' | tail -20
# Pods with high restart count
kubectl get pods -A | awk '$4 > 3 {print $0}'
# Recent pod events
kubectl get events -A --sort-by='.lastTimestamp' | tail -30
Common Pod States
Running - Healthy ✓
Pod is running normally. Check if app inside is working correctly.
Pending - Waiting for resources ⏳
# Check why pod is pending
kubectl describe pod <pod-name> -n <namespace>
# Common causes:
# - Insufficient CPU/memory on nodes
# - Unbound PersistentVolumeClaim
# - Node selector not matching any nodes
# - Image pull in progress
# Check node resources
kubectl top nodes
kubectl describe nodes
CrashLoopBackOff - Application crashing ❌
# Check logs before crash
kubectl logs -n <namespace> <pod-name> --previous
# Check current logs
kubectl logs -n <namespace> <pod-name>
# Check pod events
kubectl describe pod -n <namespace> <pod-name>
# Common causes:
# - Application error on startup
# - Missing configuration/secrets
# - Failed liveness probe
# - Dependency not available
ImagePullBackOff - Cannot pull image ❌
# Check pod events for exact error
kubectl describe pod -n <namespace> <pod-name>
# Check if image exists in registry
docker pull <image-name>
# For Kind cluster, load image manually
kind load docker-image <image-name> --name agent-platform
# Check if image is in Kind cluster
docker exec agent-platform-control-plane crictl images | grep <image-name>
# Common causes:
# - Image doesn't exist
# - Wrong image tag
# - No access to registry (auth)
# - Network issues
Error - Container exited with error ❌
# Check exit code and reason
kubectl describe pod -n <namespace> <pod-name> | grep -A5 "State:"
# Check logs
kubectl logs -n <namespace> <pod-name> --previous
# Common exit codes:
# 0 - Success
# 1 - General error
# 137 - SIGKILL (OOM killed)
# 143 - SIGTERM (terminated)
OOMKilled - Out of memory ❌
# Check for OOM in events
kubectl get events -A | grep -i "OOMKilled"
# Check pod memory limits
kubectl describe pod -n <namespace> <pod-name> | grep -A10 "Limits:"
# Check actual memory usage
kubectl top pod -n <namespace> <pod-name>
# Fix: Increase memory limits
kubectl edit deployment -n <namespace> <deployment-name>
# Increase resources.limits.memory and resources.requests.memory
Systematic Troubleshooting
Step 1: Get Pod Details
# Get pod status
kubectl get pod -n <namespace> <pod-name>
# Get full pod description
kubectl describe pod -n <namespace> <pod-name>
# Check pod YAML
kubectl get pod -n <namespace> <pod-name> -o yaml
# Check pod events
kubectl get events -n <namespace> --field-selector involvedObject.name=<pod-name>
Step 2: Check Logs
# Current logs
kubectl logs -n <namespace> <pod-name>
# Previous logs (if crashed)
kubectl logs -n <namespace> <pod-name> --previous
# All containers (including sidecars)
kubectl logs -n <namespace> <pod-name> --all-containers=true
# Specific container
kubectl logs -n <namespace> <pod-name> -c <container-name>
# Follow logs
kubectl logs -n <namespace> <pod-name> -f --tail=20
Step 3: Check Resource Constraints
# Check resource usage
kubectl top pod -n <namespace> <pod-name>
# Check resource limits
kubectl describe pod -n <namespace> <pod-name> | grep -A10 "Limits:"
kubectl describe pod -n <namespace> <pod-name> | grep -A10 "Requests:"
# Check node resources
kubectl top nodes
Step 4: Check Configuration
# Check environment variables
kubectl describe pod -n <namespace> <pod-name> | grep -A20 "Environment:"
# Check mounted secrets
kubectl describe pod -n <namespace> <pod-name> | grep -A10 "Mounts:"
# Verify secret exists
kubectl get secret -n <namespace> <secret-name>
# Check configmap
kubectl get configmap -n <namespace> <configmap-name>
Step 5: Check Networking
# Check service endpoints
kubectl get endpoints -n <namespace> <service-name>
# Check if pod is in service
kubectl get endpoints -n <namespace> <service-name> -o yaml
# Test connectivity FROM the pod
kubectl exec -n <namespace> <pod-name> -- curl -I http://<service-name>
# Test connectivity TO the pod
kubectl run debug-curl --image=curlimages/curl --rm -it -- \
curl http://<pod-ip>:<port>
# Check network policies
kubectl get networkpolicy -n <namespace>
Component-Specific Troubleshooting
Weather Tool / Weather Service
# Check deployment
kubectl get deployment -n team1 weather-tool
kubectl describe deployment -n team1 weather-tool
# Check pods
kubectl get pods -n team1 -l app=weather-tool
# Check service endpoints
kubectl get endpoints -n team1 weather-tool
# Test MCP endpoint (weather-tool)
kubectl exec -n team1 deployment/weather-tool -- \
curl -I http://localhost:8000/health || echo "Health check failed"
# Check for API errors (weather service)
kubectl logs -n team1 deployment/weather-service | grep -iE "api|error|openai"
Keycloak
# Check if Keycloak is deployment or statefulset
kubectl get deployment -n keycloak keycloak 2>/dev/null || kubectl get statefulset -n keycloak keycloak
# Check pod status
kubectl get pods -n keycloak -l app=keycloak
# Check readiness
kubectl exec -n keycloak deployment/keycloak -c keycloak -- \
curl -sf http://localhost:8080/health/ready || echo "Not ready"
# Check PostgreSQL dependency
kubectl get pods -n keycloak -l app=postgresql
kubectl logs -n keycloak deployment/postgresql --tail=50 2>/dev/null
# Common issues:
# - PostgreSQL not ready
# - Database connection failures
# - Memory limits too low (increase to 1Gi)
Platform Operator
# Check operator deployment
kubectl get deployment -n kagenti-system -l control-plane=controller-manager
# Check operator pods
kubectl get pods -n kagenti-system -l control-plane=controller-manager
# Check operator logs for errors
kubectl logs -n kagenti-system -l control-plane=controller-manager | \
grep -iE "error|fail"
# Check Component CRD processing
kubectl get components -A
kubectl describe component -n <namespace> <component-name>
# Check if operator is reconciling
kubectl logs -n kagenti-system -l control-plane=controller-manager --tail=50 | \
grep -i "reconcile"
Istio Sidecars
# Check if sidecar is injected
kubectl get pod -n <namespace> <pod-name> -o jsonpath='{.spec.containers[*].name}'
# Should show: <app-container> istio-proxy
# Check sidecar status
kubectl get pod -n <namespace> <pod-name> -o jsonpath='{.status.containerStatuses[?(@.name=="istio-proxy")].ready}'
# Should show: true
# Check sidecar logs
kubectl logs -n <namespace> <pod-name> -c istio-proxy
# Common issues:
# - Sidecar not injected (check namespace label)
# - mTLS errors (check certificates)
# - Connection failures (check virtual services)
Interactive Debugging
Execute Commands in Pod
# Get shell access
kubectl exec -n <namespace> <pod-name> -it -- /bin/sh
# or
kubectl exec -n <namespace> <pod-name> -it -- /bin/bash
# Run specific command
kubectl exec -n <namespace> <pod-name> -- ls -la /app
kubectl exec -n <namespace> <pod-name> -- env
kubectl exec -n <namespace> <pod-name> -- cat /etc/resolv.conf
# Test network connectivity
kubectl exec -n <namespace> <pod-name> -- ping <service-name>
kubectl exec -n <namespace> <pod-name> -- curl http://<service-name>:<port>
kubectl exec -n <namespace> <pod-name> -- nslookup <service-name>
Debug with Temporary Pods
# Create debug pod in same namespace
kubectl run debug-pod -n <namespace> --image=busybox --rm -it -- sh
# Test network connectivity
kubectl run debug-curl -n <namespace> --image=curlimages/curl --rm -it -- \
curl -v http://<service-name>:<port>
# Test DNS resolution
kubectl run debug-dns -n <namespace> --image=busybox --rm -it -- \
nslookup <service-name>
# Check pod-to-pod connectivity
kubectl run debug-net -n <namespace> --image=nicolaka/netshoot --rm -it -- \
curl http://<pod-ip>:<port>
Restart and Recovery
Restart Pod
# Delete pod (deployment will recreate)
kubectl delete pod -n <namespace> <pod-name>
# Restart deployment (all pods)
kubectl rollout restart deployment -n <namespace> <deployment-name>
# Scale to zero and back (forces recreation)
kubectl scale deployment -n <namespace> <deployment-name> --replicas=0
kubectl scale deployment -n <namespace> <deployment-name> --replicas=1
Force Redeploy
# Update deployment to force new pods
kubectl patch deployment -n <namespace> <deployment-name> \
-p "{\"spec\":{\"template\":{\"metadata\":{\"annotations\":{\"kubectl.kubernetes.io/restartedAt\":\"$(date +%Y-%m-%dT%H:%M:%S)\"}}}}}"
# Check rollout status
kubectl rollout status deployment -n <namespace> <deployment-name>
Rollback Deployment
# Check deployment history
kubectl rollout history deployment -n <namespace> <deployment-name>
# Rollback to previous version
kubectl rollout undo deployment -n <namespace> <deployment-name>
# Rollback to specific revision
kubectl rollout undo deployment -n <namespace> <deployment-name> --to-revision=2
Resource Adjustments
Increase Memory/CPU
# Edit deployment
kubectl edit deployment -n <namespace> <deployment-name>
# Find resources section and update:
# resources:
# requests:
# memory: "256Mi"
# cpu: "100m"
# limits:
# memory: "512Mi"
# cpu: "500m"
# Or patch directly
kubectl patch deployment -n <namespace> <deployment-name> -p \
'{"spec":{"template":{"spec":{"containers":[{"name":"<container-name>","resources":{"limits":{"memory":"1Gi"}}}]}}}}'
Common Issues and Fixes
Issue: Pod stuck in Pending
Cause: Insufficient resources
Fix:
kubectl top nodes
kubectl describe nodes
# Scale down other pods or add resources to Kind cluster
Issue: CrashLoopBackOff
Cause: Application startup failure
Fix:
kubectl logs -n <namespace> <pod-name> --previous
# Fix configuration, secrets, or application code
# Redeploy
Issue: ImagePullBackOff
Cause: Image not available
Fix:
# Load image into Kind
kind load docker-image <image-name> --name agent-platform
# Or fix image name in deployment
kubectl edit deployment -n <namespace> <deployment-name>
Issue: Service has no endpoints
Cause: Pods not matching service selector
Fix:
# Check service selector
kubectl get svc -n <namespace> <service-name> -o yaml | grep -A5 selector
# Check pod labels
kubectl get pods -n <namespace> --show-labels
# Fix labels in deployment
kubectl edit deployment -n <namespace> <deployment-name>
Issue: Pod can't connect to other services
Cause: Network policy or DNS issues
Fix:
# Test DNS
kubectl exec -n <namespace> <pod-name> -- nslookup <service-name>
# Test connectivity
kubectl exec -n <namespace> <pod-name> -- curl http://<service-name>:<port>
# Check network policies
kubectl get networkpolicy -n <namespace>
kubectl describe networkpolicy -n <namespace> <policy-name>
Pro Tips
-
Start with describe:
kubectl describe podshows most common issues -
Check previous logs: For crashes,
--previousis essential - Use debug pods: Temporary pods help test networking
- Check events: Recent events often reveal the issue
- Verify resources: Memory/CPU limits cause many issues
- Test step by step: Isolate each component
- Check dependencies: Pods may fail if dependencies aren't ready
- Look at sidecars: Istio proxy logs show networking issues
- Use exec for testing: Run commands in pods to debug
- Restart if stuck: Sometimes restart clears transient issues
Related Skills
- k8s:health: Check overall platform health
- k8s:logs: Detailed log analysis
- kagenti:deploy: Full cluster redeploy if needed
chat Comments (0)
Sign in to join the discussion and leave a comment.
Skill Details
GitHub Stars
98
GitHub Forks
56
Created
Mar 2026
Last Updated
3 months ago
tools
tools debugging
Related Skills
Build your own?
Join 12,000+ developers contributing to the Claude ecosystem.
No comments yet. Be the first to share your thoughts!