k8spods | Skill Performance & Reviews | TopRankSkills

TopRank Skills

Home / Skills / tools / k8spods

k8spods

maintained by kagenti

star 98 account_tree 56 verified_user MIT License
bolt View GitHub

name: k8s:pods description: Debug and troubleshoot pod issues including crashes, failures, networking, and resource problems

Troubleshoot Pods Skill

This skill provides systematic approaches to debugging pod issues in the Kagenti platform.

Context-Safe Execution (MANDATORY)

All kubectl/oc commands MUST redirect output to files. Commands below are shown in bare form for readability. When executing, always redirect:

export LOG_DIR=/tmp/kagenti/k8s/${CLUSTER:-local}
mkdir -p $LOG_DIR

# Pattern for all kubectl commands:
kubectl <command> > $LOG_DIR/<descriptive-name>.log 2>&1 && echo "OK" || echo "FAIL (see $LOG_DIR/<descriptive-name>.log)"

# Analyze in subagent: Task(subagent_type='Explore') to read log files
# Use subagents for BOTH failure analysis AND verifying expected behavior

When to Use

  • Pods are crashlooping or failing
  • Pods stuck in Pending, ImagePullBackOff, or other error states
  • User reports application not working
  • After kagenti:deploy to verify pods are healthy
  • Investigating resource issues

Quick Pod Status Check

# All pods with status
kubectl get pods -A -o wide

# Only problematic pods
kubectl get pods -A | grep -vE "Running|Completed"

# Pods sorted by restarts
kubectl get pods -A --sort-by='.status.containerStatuses[0].restartCount' | tail -20

# Pods with high restart count
kubectl get pods -A | awk '$4 > 3 {print $0}'

# Recent pod events
kubectl get events -A --sort-by='.lastTimestamp' | tail -30

Common Pod States

Running - Healthy ✓

Pod is running normally. Check if app inside is working correctly.

Pending - Waiting for resources ⏳

# Check why pod is pending
kubectl describe pod <pod-name> -n <namespace>

# Common causes:
# - Insufficient CPU/memory on nodes
# - Unbound PersistentVolumeClaim
# - Node selector not matching any nodes
# - Image pull in progress

# Check node resources
kubectl top nodes
kubectl describe nodes

CrashLoopBackOff - Application crashing ❌

# Check logs before crash
kubectl logs -n <namespace> <pod-name> --previous

# Check current logs
kubectl logs -n <namespace> <pod-name>

# Check pod events
kubectl describe pod -n <namespace> <pod-name>

# Common causes:
# - Application error on startup
# - Missing configuration/secrets
# - Failed liveness probe
# - Dependency not available

ImagePullBackOff - Cannot pull image ❌

# Check pod events for exact error
kubectl describe pod -n <namespace> <pod-name>

# Check if image exists in registry
docker pull <image-name>

# For Kind cluster, load image manually
kind load docker-image <image-name> --name agent-platform

# Check if image is in Kind cluster
docker exec agent-platform-control-plane crictl images | grep <image-name>

# Common causes:
# - Image doesn't exist
# - Wrong image tag
# - No access to registry (auth)
# - Network issues

Error - Container exited with error ❌

# Check exit code and reason
kubectl describe pod -n <namespace> <pod-name> | grep -A5 "State:"

# Check logs
kubectl logs -n <namespace> <pod-name> --previous

# Common exit codes:
# 0 - Success
# 1 - General error
# 137 - SIGKILL (OOM killed)
# 143 - SIGTERM (terminated)

OOMKilled - Out of memory ❌

# Check for OOM in events
kubectl get events -A | grep -i "OOMKilled"

# Check pod memory limits
kubectl describe pod -n <namespace> <pod-name> | grep -A10 "Limits:"

# Check actual memory usage
kubectl top pod -n <namespace> <pod-name>

# Fix: Increase memory limits
kubectl edit deployment -n <namespace> <deployment-name>
# Increase resources.limits.memory and resources.requests.memory

Systematic Troubleshooting

Step 1: Get Pod Details

# Get pod status
kubectl get pod -n <namespace> <pod-name>

# Get full pod description
kubectl describe pod -n <namespace> <pod-name>

# Check pod YAML
kubectl get pod -n <namespace> <pod-name> -o yaml

# Check pod events
kubectl get events -n <namespace> --field-selector involvedObject.name=<pod-name>

Step 2: Check Logs

# Current logs
kubectl logs -n <namespace> <pod-name>

# Previous logs (if crashed)
kubectl logs -n <namespace> <pod-name> --previous

# All containers (including sidecars)
kubectl logs -n <namespace> <pod-name> --all-containers=true

# Specific container
kubectl logs -n <namespace> <pod-name> -c <container-name>

# Follow logs
kubectl logs -n <namespace> <pod-name> -f --tail=20

Step 3: Check Resource Constraints

# Check resource usage
kubectl top pod -n <namespace> <pod-name>

# Check resource limits
kubectl describe pod -n <namespace> <pod-name> | grep -A10 "Limits:"
kubectl describe pod -n <namespace> <pod-name> | grep -A10 "Requests:"

# Check node resources
kubectl top nodes

Step 4: Check Configuration

# Check environment variables
kubectl describe pod -n <namespace> <pod-name> | grep -A20 "Environment:"

# Check mounted secrets
kubectl describe pod -n <namespace> <pod-name> | grep -A10 "Mounts:"

# Verify secret exists
kubectl get secret -n <namespace> <secret-name>

# Check configmap
kubectl get configmap -n <namespace> <configmap-name>

Step 5: Check Networking

# Check service endpoints
kubectl get endpoints -n <namespace> <service-name>

# Check if pod is in service
kubectl get endpoints -n <namespace> <service-name> -o yaml

# Test connectivity FROM the pod
kubectl exec -n <namespace> <pod-name> -- curl -I http://<service-name>

# Test connectivity TO the pod
kubectl run debug-curl --image=curlimages/curl --rm -it -- \
  curl http://<pod-ip>:<port>

# Check network policies
kubectl get networkpolicy -n <namespace>

Component-Specific Troubleshooting

Weather Tool / Weather Service

# Check deployment
kubectl get deployment -n team1 weather-tool
kubectl describe deployment -n team1 weather-tool

# Check pods
kubectl get pods -n team1 -l app=weather-tool

# Check service endpoints
kubectl get endpoints -n team1 weather-tool

# Test MCP endpoint (weather-tool)
kubectl exec -n team1 deployment/weather-tool -- \
  curl -I http://localhost:8000/health || echo "Health check failed"

# Check for API errors (weather service)
kubectl logs -n team1 deployment/weather-service | grep -iE "api|error|openai"

Keycloak

# Check if Keycloak is deployment or statefulset
kubectl get deployment -n keycloak keycloak 2>/dev/null || kubectl get statefulset -n keycloak keycloak

# Check pod status
kubectl get pods -n keycloak -l app=keycloak

# Check readiness
kubectl exec -n keycloak deployment/keycloak -c keycloak -- \
  curl -sf http://localhost:8080/health/ready || echo "Not ready"

# Check PostgreSQL dependency
kubectl get pods -n keycloak -l app=postgresql
kubectl logs -n keycloak deployment/postgresql --tail=50 2>/dev/null

# Common issues:
# - PostgreSQL not ready
# - Database connection failures
# - Memory limits too low (increase to 1Gi)

Platform Operator

# Check operator deployment
kubectl get deployment -n kagenti-system -l control-plane=controller-manager

# Check operator pods
kubectl get pods -n kagenti-system -l control-plane=controller-manager

# Check operator logs for errors
kubectl logs -n kagenti-system -l control-plane=controller-manager | \
  grep -iE "error|fail"

# Check Component CRD processing
kubectl get components -A
kubectl describe component -n <namespace> <component-name>

# Check if operator is reconciling
kubectl logs -n kagenti-system -l control-plane=controller-manager --tail=50 | \
  grep -i "reconcile"

Istio Sidecars

# Check if sidecar is injected
kubectl get pod -n <namespace> <pod-name> -o jsonpath='{.spec.containers[*].name}'
# Should show: <app-container> istio-proxy

# Check sidecar status
kubectl get pod -n <namespace> <pod-name> -o jsonpath='{.status.containerStatuses[?(@.name=="istio-proxy")].ready}'
# Should show: true

# Check sidecar logs
kubectl logs -n <namespace> <pod-name> -c istio-proxy

# Common issues:
# - Sidecar not injected (check namespace label)
# - mTLS errors (check certificates)
# - Connection failures (check virtual services)

Interactive Debugging

Execute Commands in Pod

# Get shell access
kubectl exec -n <namespace> <pod-name> -it -- /bin/sh
# or
kubectl exec -n <namespace> <pod-name> -it -- /bin/bash

# Run specific command
kubectl exec -n <namespace> <pod-name> -- ls -la /app
kubectl exec -n <namespace> <pod-name> -- env
kubectl exec -n <namespace> <pod-name> -- cat /etc/resolv.conf

# Test network connectivity
kubectl exec -n <namespace> <pod-name> -- ping <service-name>
kubectl exec -n <namespace> <pod-name> -- curl http://<service-name>:<port>
kubectl exec -n <namespace> <pod-name> -- nslookup <service-name>

Debug with Temporary Pods

# Create debug pod in same namespace
kubectl run debug-pod -n <namespace> --image=busybox --rm -it -- sh

# Test network connectivity
kubectl run debug-curl -n <namespace> --image=curlimages/curl --rm -it -- \
  curl -v http://<service-name>:<port>

# Test DNS resolution
kubectl run debug-dns -n <namespace> --image=busybox --rm -it -- \
  nslookup <service-name>

# Check pod-to-pod connectivity
kubectl run debug-net -n <namespace> --image=nicolaka/netshoot --rm -it -- \
  curl http://<pod-ip>:<port>

Restart and Recovery

Restart Pod

# Delete pod (deployment will recreate)
kubectl delete pod -n <namespace> <pod-name>

# Restart deployment (all pods)
kubectl rollout restart deployment -n <namespace> <deployment-name>

# Scale to zero and back (forces recreation)
kubectl scale deployment -n <namespace> <deployment-name> --replicas=0
kubectl scale deployment -n <namespace> <deployment-name> --replicas=1

Force Redeploy

# Update deployment to force new pods
kubectl patch deployment -n <namespace> <deployment-name> \
  -p "{\"spec\":{\"template\":{\"metadata\":{\"annotations\":{\"kubectl.kubernetes.io/restartedAt\":\"$(date +%Y-%m-%dT%H:%M:%S)\"}}}}}"

# Check rollout status
kubectl rollout status deployment -n <namespace> <deployment-name>

Rollback Deployment

# Check deployment history
kubectl rollout history deployment -n <namespace> <deployment-name>

# Rollback to previous version
kubectl rollout undo deployment -n <namespace> <deployment-name>

# Rollback to specific revision
kubectl rollout undo deployment -n <namespace> <deployment-name> --to-revision=2

Resource Adjustments

Increase Memory/CPU

# Edit deployment
kubectl edit deployment -n <namespace> <deployment-name>

# Find resources section and update:
# resources:
#   requests:
#     memory: "256Mi"
#     cpu: "100m"
#   limits:
#     memory: "512Mi"
#     cpu: "500m"

# Or patch directly
kubectl patch deployment -n <namespace> <deployment-name> -p \
  '{"spec":{"template":{"spec":{"containers":[{"name":"<container-name>","resources":{"limits":{"memory":"1Gi"}}}]}}}}'

Common Issues and Fixes

Issue: Pod stuck in Pending

Cause: Insufficient resources

Fix:

kubectl top nodes
kubectl describe nodes
# Scale down other pods or add resources to Kind cluster

Issue: CrashLoopBackOff

Cause: Application startup failure

Fix:

kubectl logs -n <namespace> <pod-name> --previous
# Fix configuration, secrets, or application code
# Redeploy

Issue: ImagePullBackOff

Cause: Image not available

Fix:

# Load image into Kind
kind load docker-image <image-name> --name agent-platform

# Or fix image name in deployment
kubectl edit deployment -n <namespace> <deployment-name>

Issue: Service has no endpoints

Cause: Pods not matching service selector

Fix:

# Check service selector
kubectl get svc -n <namespace> <service-name> -o yaml | grep -A5 selector

# Check pod labels
kubectl get pods -n <namespace> --show-labels

# Fix labels in deployment
kubectl edit deployment -n <namespace> <deployment-name>

Issue: Pod can't connect to other services

Cause: Network policy or DNS issues

Fix:

# Test DNS
kubectl exec -n <namespace> <pod-name> -- nslookup <service-name>

# Test connectivity
kubectl exec -n <namespace> <pod-name> -- curl http://<service-name>:<port>

# Check network policies
kubectl get networkpolicy -n <namespace>
kubectl describe networkpolicy -n <namespace> <policy-name>

Pro Tips

  1. Start with describe: kubectl describe pod shows most common issues
  2. Check previous logs: For crashes, --previous is essential
  3. Use debug pods: Temporary pods help test networking
  4. Check events: Recent events often reveal the issue
  5. Verify resources: Memory/CPU limits cause many issues
  6. Test step by step: Isolate each component
  7. Check dependencies: Pods may fail if dependencies aren't ready
  8. Look at sidecars: Istio proxy logs show networking issues
  9. Use exec for testing: Run commands in pods to debug
  10. Restart if stuck: Sometimes restart clears transient issues

Related Skills

  • k8s:health: Check overall platform health
  • k8s:logs: Detailed log analysis
  • kagenti:deploy: Full cluster redeploy if needed

chat Comments (0)

chat_bubble_outline

No comments yet. Be the first to share your thoughts!

Skill Details

GitHub Stars 98
GitHub Forks 56
Created Mar 2026
Last Updated 3个月前
tools tools debugging

Related Skills

fabric
chevron_right
typescript-expert
chevron_right
break-loop
chevron_right
burp-suite
chevron_right
page-behavior-audit
chevron_right

Build your own?

Join 12,000+ developers contributing to the Claude ecosystem.