name: On-Call Playbooks description: Comprehensive runbooks and playbooks for on-call incident response

On-Call Playbooks and Runbooks

Overview

Playbooks and runbooks are essential tools for on-call engineers, providing step-by-step guidance for responding to common incidents. Well-written runbooks reduce Mean Time To Recovery (MTTR), enable consistent responses, and help less experienced engineers handle complex situations.

Core Principle: "Document once, execute many times. Make incident response repeatable and reliable."

1. Purpose of Playbooks and Runbooks

Why They Matter

Benefits:
✓ Faster incident response (no need to figure out from scratch)
✓ Consistent response (same steps every time)
✓ Knowledge sharing (junior engineers can handle incidents)
✓ Reduced stress (clear guidance during high-pressure situations)
✓ Continuous improvement (update based on learnings)
✓ Onboarding tool (new engineers learn system architecture)

Without Runbooks:
❌ Every incident is a new adventure
❌ Tribal knowledge (only seniors know what to do)
❌ Inconsistent responses
❌ Longer MTTR
❌ Higher stress

When to Use

Use runbooks for:
✓ Common incidents (happens monthly or more)
✓ Complex procedures (multiple steps)
✓ High-risk operations (database failover)
✓ Time-sensitive incidents (every minute counts)
✓ Knowledge preservation (expert leaving team)

Don't need runbooks for:
✗ One-time incidents (not repeatable)
✗ Trivial issues (restart service)
✗ Well-documented in product docs

2. Playbook vs Runbook Distinction

Playbook

Definition: High-level strategy and decision-making guide

Characteristics:
- Strategic approach
- Decision trees
- When to escalate
- Communication templates
- Multiple possible paths

Example: "Incident Response Playbook"
- How to assess severity
- When to create war room
- Communication cadence
- Escalation criteria

Runbook

Definition: Step-by-step tactical procedures

Characteristics:
- Specific commands
- Exact steps to follow
- Copy-pasteable code
- Expected outcomes
- Troubleshooting steps

Example: "Database Failover Runbook"
1. Check primary database health
2. Verify replica is in sync
3. Promote replica to primary
4. Update application config
5. Verify traffic routing

Relationship

Playbook (Strategy)
  ├─ Runbook 1 (Tactics)
  ├─ Runbook 2 (Tactics)
  └─ Runbook 3 (Tactics)

Example:
"High Error Rate Playbook"
  ├─ Check Recent Deployments Runbook
  ├─ Database Performance Runbook
  └─ Rollback Deployment Runbook

3. Essential Playbooks

Playbook 1: Service is Down

# Playbook: Service is Down

## Symptoms
- Health check failing
- 100% error rate
- No traffic reaching service
- Users reporting "site is down"

## Initial Triage (2 minutes)
1. Verify service is actually down
   ```bash
   curl -i https://api.example.com/health

Check if pods/instances are running
```
kubectl get pods -l app=api-service
```

Check recent changes

kubectl rollout history deployment/api-service

Decision Tree

Is service responding to health checks?
├─ NO → Are pods running?
│   ├─ NO → Pods crashed
│   │   └─ Go to: Pod Crash Runbook
│   └─ YES → Network/routing issue
│       └─ Go to: Network Troubleshooting Runbook
└─ YES → Partial outage
    └─ Go to: High Error Rate Playbook

Common Causes

Recent deployment broke service
Database connection failed
Out of memory/CPU
Configuration error
Network partition

Communication

Severity: SEV0 or SEV1
Status page: "Major outage"
Update frequency: Every 15 minutes


### Playbook 2: Database is Slow

```markdown
# Playbook: Database is Slow

## Symptoms
- Query latency > 2x baseline
- Connection pool exhaustion
- Timeouts on database operations
- Application slow/timing out

## Initial Triage (3 minutes)
1. Check database metrics
   ```sql
   SELECT * FROM pg_stat_activity WHERE state = 'active';

Identify slow queries

SELECT pid, now() - query_start AS duration, query
FROM pg_stat_activity
WHERE state = 'active'
ORDER BY duration DESC
LIMIT 10;

Check resource usage

# CPU, memory, disk I/O
kubectl top pods -l app=postgres

Decision Tree

Are there long-running queries?
├─ YES → Are they legitimate?
│   ├─ YES → Optimize or kill
│   └─ NO → Kill immediately
└─ NO → Check resource usage
    ├─ High CPU → Scale up or optimize
    ├─ High I/O → Check disk, add read replicas
    └─ Normal → Check connection pool

Runbooks to Execute

Kill Long-Running Queries
Database Failover
Add Read Replica


### Playbook 3: High Error Rate

```markdown
# Playbook: High Error Rate

## Symptoms
- Error rate > 5% (baseline: <0.1%)
- Alerts firing for 5xx errors
- Users reporting errors

## Initial Triage (3 minutes)
1. Check error rate trend
   - Is it increasing, stable, or decreasing?

2. Identify affected endpoints
   ```bash
   kubectl logs -l app=api --since=5m | grep "HTTP/1.1 5" | awk '{print $7}' | sort | uniq -c | sort -rn

Check recent deployments
```
git log --since="1 hour ago" --oneline
```

Decision Tree

Error rate > 50%?
├─ YES → Recent deployment?
│   ├─ YES → ROLLBACK immediately
│   └─ NO → Check dependencies
└─ NO → Specific endpoint?
    ├─ YES → Disable endpoint, investigate
    └─ NO → Monitor, investigate root cause

Quick Actions

Rollback if recent deployment
Disable failing endpoint
Scale up if resource constrained
Failover if database issue


### Playbook 4: Disk Full

```markdown
# Playbook: Disk Full

## Symptoms
- Disk usage > 95%
- Write operations failing
- Application crashes
- "No space left on device" errors

## Initial Triage (2 minutes)
1. Check disk usage
   ```bash
   df -h
   du -sh /* | sort -rh | head -10

Identify large files/directories

find / -type f -size +1G -exec ls -lh {} \;

Immediate Actions

Clear logs

find /var/log -name "*.log" -mtime +7 -delete
journalctl --vacuum-time=7d

Clear temp files
```
rm -rf /tmp/*
```
Clear old Docker images
```
docker system prune -af --volumes
```

Long-term Fix

Implement log rotation
Set up disk monitoring
Increase disk size
Archive old data


### Playbook 5: Memory Leak

```markdown
# Playbook: Memory Leak

## Symptoms
- Memory usage gradually increasing
- OOM (Out of Memory) kills
- Slow performance
- Frequent restarts

## Initial Triage (5 minutes)
1. Check memory trend
   ```bash
   kubectl top pods -l app=api --sort-by=memory

Check for OOM kills

kubectl get pods -l app=api | grep OOMKilled
kubectl describe pod <pod-name>

Review recent code changes

git log --since="1 week ago" --grep="cache\|memory"

Immediate Actions

Restart affected pods
```
kubectl rollout restart deployment/api
```

Increase memory limits (temporary)

kubectl set resources deployment/api --limits=memory=4Gi

Investigation

Profile memory usage
Review cache implementation
Check for connection leaks
Analyze heap dumps


### Playbook 6: Certificate Expiration

```markdown
# Playbook: Certificate Expiration

## Symptoms
- SSL/TLS errors
- "Certificate expired" warnings
- HTTPS connections failing

## Prevention (Before Expiration)
1. Check certificate expiration
   ```bash
   echo | openssl s_client -servername example.com -connect example.com:443 2>/dev/null | openssl x509 -noout -dates

Set up monitoring
```
# Alert if cert expires in < 30 days
```

Immediate Actions (After Expiration)

Renew certificate
```
certbot renew --force-renewal
```

Update Kubernetes secret

kubectl create secret tls example-tls --cert=cert.pem --key=key.pem --dry-run=client -o yaml | kubectl apply -f -

Restart ingress controller

kubectl rollout restart deployment/nginx-ingress-controller

Long-term Fix

Automate certificate renewal (cert-manager)
Set up expiration alerts (30, 14, 7 days)


### Playbook 7: DDoS Attack

```markdown
# Playbook: DDoS Attack

## Symptoms
- Sudden traffic spike (10x-100x normal)
- Legitimate users can't access service
- High bandwidth usage
- Resource exhaustion

## Initial Triage (3 minutes)
1. Confirm DDoS vs legitimate traffic
   ```bash
   # Check traffic sources
   tail -f /var/log/nginx/access.log | awk '{print $1}' | sort | uniq -c | sort -rn

Identify attack pattern
- Single IP or distributed?
- Specific endpoint targeted?
- Request pattern (same User-Agent, etc.)

Immediate Actions

Enable rate limiting

limit_req_zone $binary_remote_addr zone=ddos:10m rate=10r/s;
limit_req zone=ddos burst=20 nodelay;

Block attacking IPs

iptables -A INPUT -s <attacker-ip> -j DROP

Enable CloudFlare "Under Attack" mode
```
# Or equivalent CDN protection
```

Scale up infrastructure (if needed)

kubectl scale deployment/api --replicas=50

Escalation

Contact CDN provider (CloudFlare, Akamai)
Contact ISP for upstream filtering
Consider AWS Shield / GCP Cloud Armor


### Playbook 8: Data Loss Incident

```markdown
# Playbook: Data Loss Incident

## Symptoms
- Data missing from database
- User reports data disappeared
- Accidental DELETE/DROP executed

## STOP THE BLEEDING (Immediate)
1. **DO NOT** make any more changes
2. Stop all write operations
   ```bash
   # Put database in read-only mode
   ALTER DATABASE mydb SET default_transaction_read_only = on;

Identify scope of data loss

SELECT COUNT(*) FROM users WHERE deleted_at > NOW() - INTERVAL '1 hour';

Assessment (5 minutes)

What data was lost?
When did it happen?
How much data?
Is backup available?

Recovery Options

Option 1: Point-in-Time Recovery

# Restore from backup to specific time
pg_restore --dbname=mydb_recovery backup.dump

Option 2: Replay from WAL

# PostgreSQL WAL replay
recovery_target_time = '2024-01-15 10:00:00'

Option 3: Restore from Replica

# If replica has data, promote it

Communication

Severity: SEV0 (data loss)
Notify: All stakeholders immediately
Transparency: Explain what happened, what we're doing


## 4. Runbook Structure

### Complete Runbook Template

```markdown
# Runbook: [Title]

## Metadata
- **ID**: RB-001
- **Owner**: @team-platform
- **Last Updated**: 2024-01-15
- **Tested**: 2024-01-10
- **Severity**: SEV1
- **Estimated Time**: 15 minutes

## Symptoms
Clear description of what the on-call engineer will observe:
- Specific error messages
- Metric thresholds
- User-reported issues
- Alert names

## Triage Steps
Quick diagnostic steps to confirm this is the right runbook:

1. **Check [specific metric]**
   ```bash
   # Command to run

Expected: [what you should see] Actual: [what you're seeing if this runbook applies]

Verify [condition]
```
# Command to run
```

Common Causes

Ranked by likelihood:

Recent deployment (70%)
Database issue (20%)
External dependency (10%)

Resolution Steps

Step 1: [Action]

Time estimate: 2 minutes

Commands:

# Exact commands to run
kubectl get pods -l app=api-service

Expected outcome:

NAME                          READY   STATUS    RESTARTS   AGE
api-service-7d9f8b6c4-abc12   1/1     Running   0          5m

If this fails:

Try: [alternative approach]
Or: Escalate to [team/person]

Step 2: [Action]

Time estimate: 3 minutes

[Continue with detailed steps...]

Rollback Procedures

If resolution fails, how to undo changes:

# Rollback commands
kubectl rollout undo deployment/api-service

Escalation Path

When to escalate and to whom:

If not resolved in 15 minutes → Escalate to @senior-engineer
If database-related → Escalate to @database-team
If affects payments → Escalate to @payments-team

Related Runbooks

Verification

How to confirm the issue is resolved:

Check error rate
```
# Command to verify
```
Expected: < 0.1%
Check user reports
- Support tickets should stop coming in
Monitor for 10 minutes
- Ensure issue doesn't recur

Post-Incident

Update incident timeline
Document any deviations from runbook
Create postmortem ticket (if SEV0/1)
Update runbook if needed

Notes

Additional context, gotchas, or tips:

Note: This issue often recurs after 24 hours
Tip: Check CloudFlare cache if API seems fine but users report issues
Warning: Don't run this command in production without backup


## 5. Runbook Best Practices

### 1. Step-by-Step Instructions

```markdown
❌ Bad:
"Fix the database connection issue"

✓ Good:
"1. Check database connectivity:
   ```bash
   psql -h db.example.com -U app -c 'SELECT 1'

Expected output: 1

If connection fails, check credentials:

kubectl get secret db-credentials -o jsonpath='{.data.password}' | base64 -d

Verify database is running:
```
kubectl get pods -l app=postgres
```"
```


### 2. Command Examples (Copy-Pasteable)

```markdown
✓ Include full commands with all flags:
```bash
kubectl logs deployment/api-service \
  --tail=100 \
  --since=15m \
  --all-containers=true

✓ Include expected output:

Expected:
2024-01-15 10:00:00 INFO Server started on port 8080

✓ Include error output:

If you see:
Error: connection refused
Then: Database is down, proceed to Step 5


### 3. Decision Points

```markdown
## Step 3: Check Error Rate

```bash
curl "https://grafana.example.com/api/metrics/error-rate"

Decision:

If error rate > 50% → Go to Step 4 (Rollback)
If error rate 10-50% → Go to Step 5 (Investigate)
If error rate < 10% → Go to Step 6 (Monitor)


### 4. Expected Outcomes

```markdown
## Step 2: Restart Service

```bash
kubectl rollout restart deployment/api-service

Expected outcome (within 2 minutes):

All pods show Running status
Error rate drops below 1%
Latency returns to < 200ms

If outcome not achieved:

Wait additional 3 minutes
If still failing, proceed to Step 3 (Rollback)
If pods crash, check logs: kubectl logs <pod-name> --previous


### 5. Time Estimates

```markdown
## Resolution Steps

### Step 1: Identify Issue (2 minutes)
[Commands...]

### Step 2: Apply Fix (5 minutes)
[Commands...]

### Step 3: Verify (3 minutes)
[Commands...]

**Total estimated time**: 10 minutes
**If exceeds 15 minutes**: Escalate to senior engineer

6. Runbook Organization and Discoverability

Directory Structure

runbooks/
├── README.md (index of all runbooks)
├── infrastructure/
│   ├── kubernetes-pod-crashloop.md
│   ├── disk-full.md
│   └── network-partition.md
├── database/
│   ├── postgres-connection-pool.md
│   ├── postgres-slow-queries.md
│   └── postgres-failover.md
├── application/
│   ├── api-high-error-rate.md
│   ├── memory-leak.md
│   └── deployment-rollback.md
└── security/
    ├── ddos-attack.md
    ├── certificate-expiration.md
    └── security-breach.md

Runbook Index

# Runbook Index

## By Symptom
- **Service is down** → [Pod Crashloop](infrastructure/kubernetes-pod-crashloop.md)
- **High error rate** → [API Errors](application/api-high-error-rate.md)
- **Slow database** → [Slow Queries](database/postgres-slow-queries.md)
- **Disk full** → [Disk Full](infrastructure/disk-full.md)

## By Alert Name
- `HighErrorRate` → [API Errors](application/api-high-error-rate.md)
- `PodCrashLooping` → [Pod Crashloop](infrastructure/kubernetes-pod-crashloop.md)
- `DatabaseSlow` → [Slow Queries](database/postgres-slow-queries.md)

## By Severity
- **SEV0**: [Data Loss](database/data-loss.md), [Security Breach](security/security-breach.md)
- **SEV1**: [Service Down](infrastructure/service-down.md), [Database Failover](database/postgres-failover.md)
- **SEV2**: [High Latency](application/high-latency.md), [Disk Full](infrastructure/disk-full.md)

## Recently Updated
- 2024-01-15: [Pod Crashloop](infrastructure/kubernetes-pod-crashloop.md)
- 2024-01-14: [API Errors](application/api-high-error-rate.md)

Search Tags

# Runbook: PostgreSQL Connection Pool Exhausted

**Tags**: #database #postgresql #connection-pool #sev1
**Alerts**: `DatabaseConnectionPoolExhausted`, `HighDatabaseLatency`
**Symptoms**: timeout, connection refused, pool exhausted

7. Runbook Versioning

Version Control

# Runbook: Database Failover

**Version**: 2.1.0
**Last Updated**: 2024-01-15
**Author**: @alice
**Changelog**:
- 2.1.0 (2024-01-15): Added automated failover steps
- 2.0.0 (2024-01-01): Complete rewrite for Kubernetes
- 1.5.0 (2023-12-01): Added rollback procedure
- 1.0.0 (2023-10-01): Initial version

**Breaking Changes in 2.0.0**:
- Commands changed from Docker to Kubernetes
- Failover now automated (manual override available)

Git-Based Versioning

# Track runbooks in Git
runbooks/
├── .git/
├── database/
│   └── failover.md
└── CHANGELOG.md

# View runbook history
git log --follow runbooks/database/failover.md

# Compare versions
git diff v1.0.0 v2.0.0 runbooks/database/failover.md

8. Runbook Testing (DR Drills)

Test Schedule

Monthly: Test critical runbooks (SEV0/1)
Quarterly: Test all runbooks
Annually: Full disaster recovery drill

Example Schedule:
- Week 1: Database failover
- Week 2: Service rollback
- Week 3: DDoS response
- Week 4: Data recovery

Test Procedure

# Runbook Test: Database Failover

**Date**: 2024-01-15
**Tester**: @bob
**Environment**: Staging

## Test Steps
1. ✅ Follow runbook exactly as written
2. ✅ Time each step
3. ✅ Note any deviations
4. ✅ Verify expected outcomes

## Results
- **Total Time**: 12 minutes (target: 15 minutes) ✅
- **Success**: Yes ✅
- **Deviations**: Step 3 command had typo (fixed)

## Issues Found
1. Step 3 command missing `--namespace` flag
2. Expected output in Step 5 was outdated
3. Escalation path unclear

## Actions
- [ ] Fix command in Step 3
- [ ] Update expected output in Step 5
- [ ] Clarify escalation path
- [ ] Retest next month

9. Auto-Remediation vs Manual Runbooks

When to Automate

Automate if:
✓ Happens frequently (weekly or more)
✓ Clear root cause and fix
✓ Low risk of making things worse
✓ Can be fully automated
✓ Saves significant time

Examples:
- Restart crashed pod
- Clear disk space (delete old logs)
- Scale up on high CPU
- Renew expiring certificates

When to Keep Manual

Keep manual if:
✓ Requires judgment/decision-making
✓ High risk (data loss, security)
✓ Rare occurrence
✓ Complex troubleshooting needed
✓ Regulatory/compliance requires human approval

Examples:
- Database failover
- Data recovery
- Security incident response
- Major architecture changes

Hybrid Approach

// Auto-remediation with human oversight
async function autoRemediate(alert: Alert) {
  // Attempt automatic fix
  const result = await attemptAutoFix(alert);

  if (result.success) {
    // Fixed automatically
    await notifySlack(`✅ Auto-remediated: ${alert.name}`);
    await createTicket(alert, 'auto-resolved');
  } else {
    // Failed, escalate to human
    await pageOnCall(alert);
    await notifySlack(`⚠️ Auto-remediation failed: ${alert.name}. Paging on-call.`);
  }
}

10. Runbook Tools

PagerDuty Runbooks

# PagerDuty runbook integration
service:
  name: api-service
  escalation_policy: engineering
  incident_urgency_rule:
    type: constant
    urgency: high
  auto_resolve_timeout: 14400
  acknowledgement_timeout: 600
  runbook_url: https://wiki.example.com/runbooks/api-service

Confluence / Notion

Advantages:
✓ Rich formatting
✓ Easy collaboration
✓ Version history
✓ Search functionality
✓ Comments and discussions

Disadvantages:
✗ Not in version control
✗ Requires login
✗ Can become outdated

GitHub Wikis

Advantages:
✓ Version controlled
✓ Markdown format
✓ Easy to update (PR workflow)
✓ Free and accessible

Disadvantages:
✗ Less rich formatting
✗ Requires Git knowledge

Internal Tools

// Custom runbook platform
interface Runbook {
  id: string;
  title: string;
  severity: string;
  steps: RunbookStep[];
  lastTested: Date;
  owner: string;
}

interface RunbookStep {
  number: number;
  title: string;
  description: string;
  commands: string[];
  expectedOutcome: string;
  timeEstimate: number; // minutes
}

// Runbook execution tracking
async function executeRunbook(runbookId: string, incidentId: string) {
  const runbook = await getRunbook(runbookId);
  const execution = await createExecution(runbookId, incidentId);

  for (const step of runbook.steps) {
    const startTime = Date.now();
    
    // Show step to engineer
    await displayStep(step);
    
    // Wait for confirmation
    await waitForConfirmation();
    
    // Record execution time
    const duration = Date.now() - startTime;
    await recordStepExecution(execution.id, step.number, duration);
  }

  await completeExecution(execution.id);
}

11. Runbook Maintenance (Keep Updated)

Maintenance Schedule

After each incident:
- Update runbook if steps changed
- Add new troubleshooting tips
- Document what worked/didn't work

Monthly:
- Review runbook usage stats
- Update outdated screenshots
- Test critical runbooks

Quarterly:
- Full runbook audit
- Archive unused runbooks
- Create runbooks for new common issues

Runbook Health Metrics

interface RunbookMetrics {
  id: string;
  title: string;
  lastUsed: Date;
  usageCount: number;
  lastTested: Date;
  lastUpdated: Date;
  averageExecutionTime: number;
  successRate: number;
  feedbackScore: number; // 1-5
}

// Identify stale runbooks
function findStaleRunbooks(runbooks: RunbookMetrics[]): RunbookMetrics[] {
  const sixMonthsAgo = new Date();
  sixMonthsAgo.setMonth(sixMonthsAgo.getMonth() - 6);

  return runbooks.filter(rb =>
    rb.lastUpdated < sixMonthsAgo ||
    rb.lastTested < sixMonthsAgo
  );
}

12. Integration with Monitoring (Links from Alerts)

Alert to Runbook Linking

# Prometheus alert with runbook link
groups:
  - name: api-service
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
        annotations:
          summary: "High error rate on {{ $labels.service }}"
          description: "Error rate is {{ $value | humanizePercentage }}"
          runbook_url: "https://runbooks.example.com/api-high-error-rate"

PagerDuty Integration

{
  "incident": {
    "title": "High Error Rate on API Service",
    "service": {
      "id": "PSERVICE1"
    },
    "body": {
      "type": "incident_body",
      "details": "Error rate: 15%"
    },
    "incident_key": "api-high-error-rate",
    "custom_details": {
      "runbook": "https://runbooks.example.com/api-high-error-rate",
      "dashboard": "https://grafana.example.com/d/api-service"
    }
  }
}

13. Example Runbooks

Runbook 1: Kubernetes Pod CrashLoop

# Runbook: Kubernetes Pod CrashLoopBackOff

## Metadata
- **ID**: RB-K8S-001
- **Severity**: SEV1
- **Estimated Time**: 10 minutes
- **Owner**: @platform-team

## Symptoms
- Alert: `PodCrashLooping`
- Pod status: `CrashLoopBackOff`
- Application unavailable or degraded

## Triage (2 minutes)

### Step 1: Identify crashing pods
```bash
kubectl get pods -l app=api-service | grep CrashLoopBackOff

Step 2: Check pod logs

# Current logs
kubectl logs <pod-name>

# Previous container logs (if restarted)
kubectl logs <pod-name> --previous

Step 3: Check pod events

kubectl describe pod <pod-name>

Common Causes

1. Application Error (60%)

Symptoms in logs:

Error: Cannot connect to database
Fatal: Configuration file not found

Resolution: Fix application code or configuration

2. Resource Limits (20%)

Symptoms in events:

OOMKilled
Liveness probe failed

Resolution: Increase resource limits

3. Missing Dependencies (15%)

Symptoms in logs:

Error: Secret "db-credentials" not found
Error: ConfigMap "app-config" not found

Resolution: Create missing resources

4. Image Pull Error (5%)

Symptoms in events:

Failed to pull image "api-service:v2.0.0"
ImagePullBackOff

Resolution: Fix image name or registry credentials

Resolution Steps

For Application Errors

Step 1: Identify error from logs

kubectl logs <pod-name> --previous | tail -50

Step 2: Rollback to previous version

kubectl rollout undo deployment/api-service

Step 3: Verify rollback

kubectl rollout status deployment/api-service

Expected: deployment "api-service" successfully rolled out

For Resource Limits

Step 1: Check current limits

kubectl get deployment api-service -o jsonpath='{.spec.template.spec.containers[0].resources}'

Step 2: Increase limits

kubectl set resources deployment/api-service \
  --limits=cpu=2,memory=4Gi \
  --requests=cpu=1,memory=2Gi

Step 3: Verify pods running

kubectl get pods -l app=api-service

For Missing Dependencies

Step 1: Identify missing resource

kubectl describe pod <pod-name> | grep -A 5 "Error"

Step 2: Create missing secret/configmap

# Example: Create database credentials
kubectl create secret generic db-credentials \
  --from-literal=username=app \
  --from-literal=password=secret123

Step 3: Restart deployment

kubectl rollout restart deployment/api-service

Verification (2 minutes)

Check pod status
```
kubectl get pods -l app=api-service
```
Expected: All pods Running with READY 1/1
Check application logs
```
kubectl logs -l app=api-service --tail=20
```
Expected: No error messages
Test endpoint
```
curl -i https://api.example.com/health
```
Expected: 200 OK

Escalation

If not resolved in 10 minutes → Escalate to @platform-team
If application-specific → Escalate to @app-team
If database-related → Escalate to @database-team


### Runbook 2: PostgreSQL Connection Pool Exhausted

```markdown
# Runbook: PostgreSQL Connection Pool Exhausted

## Metadata
- **ID**: RB-DB-001
- **Severity**: SEV1
- **Estimated Time**: 15 minutes
- **Owner**: @database-team

## Symptoms
- Alert: `DatabaseConnectionPoolExhausted`
- Error logs: "Connection pool exhausted"
- Application timeouts
- High error rate

## Triage (3 minutes)

### Step 1: Check connection pool usage
```sql
SELECT count(*) FROM pg_stat_activity;

Compare to max connections:

SHOW max_connections;

Step 2: Identify connection sources

SELECT application_name, count(*) 
FROM pg_stat_activity 
GROUP BY application_name 
ORDER BY count DESC;

Step 3: Check for long-running queries

SELECT pid, now() - query_start AS duration, query, state
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY duration DESC
LIMIT 10;

Common Causes

1. Connection Leak (50%)

Application not closing connections properly

2. Traffic Spike (30%)

Sudden increase in requests

3. Slow Queries (15%)

Queries holding connections for too long

4. Misconfigured Pool (5%)

Pool size too small for workload

Resolution Steps

Immediate Mitigation (5 minutes)

Step 1: Kill idle connections

SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
AND now() - state_change > interval '5 minutes';

Step 2: Restart application (releases connections)

kubectl rollout restart deployment/api-service

Step 3: Monitor connection count

SELECT count(*) FROM pg_stat_activity;

Expected: Connections drop below 80% of max

Long-term Fix

Step 1: Increase connection pool size (if needed)

# application config
database:
  pool:
    min: 5
    max: 50  # Increased from 20

Step 2: Fix connection leak (if found)

// Bad: Connection leak
async function badQuery() {
  const conn = await pool.getConnection();
  const result = await conn.query('SELECT * FROM users');
  return result; // Connection not released!
}

// Good: Always release
async function goodQuery() {
  const conn = await pool.getConnection();
  try {
    const result = await conn.query('SELECT * FROM users');
    return result;
  } finally {
    conn.release(); // Always released
  }
}

Step 3: Add connection pool monitoring

// Monitor pool usage
setInterval(() => {
  const poolStats = pool.getStats();
  metrics.gauge('db.pool.active', poolStats.active);
  metrics.gauge('db.pool.idle', poolStats.idle);
  metrics.gauge('db.pool.waiting', poolStats.waiting);
}, 10000);

Verification (2 minutes)

Check connection count
```
SELECT count(*) FROM pg_stat_activity;
```
Expected: < 80% of max_connections
Check error rate
```
# Should drop to < 1%
```
Monitor for 10 minutes
- Ensure connections don't grow again

Escalation

If connections don't drop → Escalate to @database-team
If application-specific leak → Escalate to @app-team
If traffic spike → Escalate to @infrastructure-team (scale up)


### Runbook 3: Redis Memory Maxed

```markdown
# Runbook: Redis Memory Maxed Out

## Metadata
- **ID**: RB-CACHE-001
- **Severity**: SEV2
- **Estimated Time**: 10 minutes

## Symptoms
- Alert: `RedisMemoryHigh`
- Redis evicting keys
- Cache misses increasing
- Application slow (cache not working)

## Triage (2 minutes)

### Step 1: Check memory usage
```bash
redis-cli INFO memory | grep used_memory_human

Step 2: Check eviction stats

redis-cli INFO stats | grep evicted_keys

Step 3: Check key count

redis-cli DBSIZE

Resolution Steps

Immediate Mitigation (5 minutes)

Step 1: Clear expired keys

redis-cli --scan --pattern "*" | xargs redis-cli DEL

Step 2: Flush least important data

# If using multiple databases
redis-cli -n 2 FLUSHDB  # Flush non-critical cache

Step 3: Increase memory limit (temporary)

kubectl set resources deployment/redis --limits=memory=4Gi

Long-term Fix

Step 1: Implement eviction policy

redis-cli CONFIG SET maxmemory-policy allkeys-lru

Step 2: Set TTL on keys

// Always set expiration
await redis.set('user:123', userData, 'EX', 3600); // 1 hour TTL

Step 3: Add memory monitoring

setInterval(async () => {
  const info = await redis.info('memory');
  const memoryUsage = parseFloat(info.match(/used_memory:(\d+)/)[1]);
  metrics.gauge('redis.memory.used', memoryUsage);
}, 30000);

Verification

Check memory usage

redis-cli INFO memory | grep used_memory_human

Expected: < 80% of maxmemory

Check cache hit rate
```
redis-cli INFO stats | grep keyspace_hits
```
Expected: Hit rate > 80%


### Runbook 4: API 5xx Spike

```markdown
# Runbook: API 5xx Error Spike

## Metadata
- **ID**: RB-API-001
- **Severity**: SEV1
- **Estimated Time**: 15 minutes

## Symptoms
- Alert: `HighErrorRate`
- Error rate > 5%
- Users reporting errors
- 500/503 responses

## Triage (3 minutes)

### Step 1: Check error rate
```bash
# Prometheus query
rate(http_requests_total{status=~"5.."}[5m])

Step 2: Identify affected endpoints

kubectl logs -l app=api --since=5m | \
  grep "HTTP/1.1 5" | \
  awk '{print $7}' | \
  sort | uniq -c | sort -rn

Step 3: Check recent deployments

kubectl rollout history deployment/api-service

Decision Tree

Error rate > 50%?
├─ YES → Recent deployment (< 1 hour)?
│   ├─ YES → ROLLBACK immediately
│   └─ NO → Check dependencies
│       ├─ Database down? → Escalate to DB team
│       └─ External API down? → Enable fallback
└─ NO (5-50% errors) → Investigate specific endpoint
    ├─ Single endpoint failing? → Disable endpoint
    └─ All endpoints affected? → Check resources

Resolution Steps

If Recent Deployment (Most Common)

Step 1: Rollback deployment

kubectl rollout undo deployment/api-service

Step 2: Monitor error rate

# Should drop within 2 minutes

Step 3: Verify rollback

kubectl rollout status deployment/api-service

If Database Issue

Step 1: Check database connectivity

kubectl run -it --rm debug --image=postgres:14 --restart=Never -- \
  psql -h db.example.com -U app -c "SELECT 1"

Step 2: If database down, escalate

# Page database team
# Check: database/postgres-failover.md runbook

If External API Issue

Step 1: Check vendor status

curl https://status.stripe.com/api/v2/status.json

Step 2: Enable fallback

// Feature flag to disable external API
await launchdarkly.variation('use-backup-payment-processor', user, false);

Verification

Error rate < 1%
Latency back to normal
No user complaints
Monitor for 10 minutes


## 14. Runbook Templates

### Quick Runbook Template

```markdown
# Runbook: [Title]

## Symptoms
- [What you'll see]

## Quick Fix
```bash
# Commands to run

If That Doesn't Work

[Step 1]
[Step 2]
Escalate to [team]

Verification

[How to confirm it's fixed]


### Detailed Runbook Template

See [Section 4: Runbook Structure](#4-runbook-structure)

## 15. Common Runbook Antipatterns

### Antipattern 1: Too Vague

❌ Bad: "Fix the database issue"

✓ Good: "1. Check database connectivity:

psql -h db.example.com -U app -c 'SELECT 1'

If connection fails, check credentials: [specific steps...]"


### Antipattern 2: Outdated Commands

❌ Bad:

docker ps  # We migrated to Kubernetes 6 months ago

✓ Good:

kubectl get pods -l app=api-service

Prevention: Test runbooks quarterly


### Antipattern 3: No Expected Outcomes

❌ Bad: "Run this command:

kubectl get pods
```"

✓ Good:
"Run this command:
```bash
kubectl get pods -l app=api-service

Expected output:

NAME                          READY   STATUS    RESTARTS   AGE
api-service-7d9f8b6c4-abc12   1/1     Running   0          5m

If you see CrashLoopBackOff, proceed to Step 5."


### Antipattern 4: No Escalation Path

❌ Bad: "If this doesn't work, good luck!"

✓ Good: "If not resolved in 15 minutes:

Escalate to @senior-engineer
Join war room: [Zoom link]
Related runbooks: [links]"


### Antipattern 5: Copy-Paste Errors

❌ Bad:

kubectl delete pod api-service-abc123  # Specific pod name

✓ Good:

kubectl delete pod <pod-name>  # Replace with actual pod name
# Or use label selector:
kubectl delete pods -l app=api-service --field-selector=status.phase=Failed


## Summary

Key takeaways for On-Call Playbooks and Runbooks:

1. **Document common incidents** - Don't reinvent the wheel each time
2. **Be specific** - Exact commands, expected outcomes, time estimates
3. **Test regularly** - Runbooks get stale quickly
4. **Keep updated** - Update after each incident
5. **Make discoverable** - Good organization and search
6. **Link from alerts** - Runbook URL in alert annotations
7. **Include escalation paths** - Know when to ask for help
8. **Version control** - Track changes over time
9. **Automate when possible** - But keep manual for complex/risky operations
10. **Continuous improvement** - Runbooks should evolve with system

## Related Skills

- `41-incident-management/incident-triage` - Initial incident assessment
- `41-incident-management/escalation-paths` - When and how to escalate
- `40-system-resilience/disaster-recovery` - DR runbooks and procedures
- `40-system-resilience/postmortem-analysis` - Learning from incidents to improve runbooks

Sign in to Comment

on-call-playbooks

name: On-Call Playbooks description: Comprehensive runbooks and playbooks for on-call incident response

On-Call Playbooks and Runbooks

Overview

1. Purpose of Playbooks and Runbooks

Why They Matter

When to Use

2. Playbook vs Runbook Distinction

Playbook

Runbook

Relationship

3. Essential Playbooks

Playbook 1: Service is Down

Decision Tree

Common Causes

Runbooks to Execute

Communication

Decision Tree

Runbooks to Execute

Decision Tree

Quick Actions

Immediate Actions

Long-term Fix

Immediate Actions

Investigation

Immediate Actions (After Expiration)

Long-term Fix

Immediate Actions

Escalation

Assessment (5 minutes)

Recovery Options

Option 1: Point-in-Time Recovery

Option 2: Replay from WAL

Option 3: Restore from Replica

Communication

Common Causes

Resolution Steps

Step 1: [Action]

Step 2: [Action]

Rollback Procedures

Escalation Path

Related Runbooks

Verification

Post-Incident

Notes

6. Runbook Organization and Discoverability

Directory Structure

Runbook Index

Search Tags

7. Runbook Versioning

Version Control

Git-Based Versioning

8. Runbook Testing (DR Drills)

Test Schedule

Test Procedure

9. Auto-Remediation vs Manual Runbooks

When to Automate

When to Keep Manual

Hybrid Approach

10. Runbook Tools

PagerDuty Runbooks

Confluence / Notion

GitHub Wikis

Internal Tools

11. Runbook Maintenance (Keep Updated)

Maintenance Schedule

Runbook Health Metrics

12. Integration with Monitoring (Links from Alerts)

Alert to Runbook Linking

PagerDuty Integration

13. Example Runbooks

Runbook 1: Kubernetes Pod CrashLoop

Step 2: Check pod logs

Step 3: Check pod events

Common Causes

1. Application Error (60%)

2. Resource Limits (20%)

3. Missing Dependencies (15%)

4. Image Pull Error (5%)