name: observability description: Observability patterns including logging, metrics, tracing, and alerting. Auto-triggers when implementing monitoring, debugging production issues, or setting up alerts.
Observability Skill
Three Pillars of Observability
1. Logs
- What happened: Discrete events with context
- Use for: Debugging, audit trails, error investigation
- Challenge: Volume and searchability
2. Metrics
- How much/how often: Numeric measurements over time
- Use for: Dashboards, alerting, capacity planning
- Challenge: Cardinality explosion
3. Traces
- Where time was spent: Request flow across services
- Use for: Latency analysis, dependency mapping
- Challenge: Sampling and storage
Structured Logging
Log Format
{
"timestamp": "2024-01-15T10:30:45.123Z",
"level": "error",
"message": "Payment failed",
"service": "payment-service",
"trace_id": "abc123",
"span_id": "def456",
"user_id": "user_789",
"error": {
"type": "PaymentDeclined",
"code": "INSUFFICIENT_FUNDS"
},
"duration_ms": 234
}
Log Levels
| Level | Use Case |
|---|---|
| ERROR | Failures requiring attention |
| WARN | Unexpected but recoverable |
| INFO | Business events, state changes |
| DEBUG | Development troubleshooting |
| TRACE | Fine-grained diagnostic |
Best Practices
- Use structured JSON format
- Include correlation IDs (trace_id)
- Never log sensitive data (PII, secrets)
- Use consistent field names
- Set appropriate log levels
Metrics Design
Types of Metrics
| Type | Example | Use Case |
|---|---|---|
| Counter | requests_total | Monotonically increasing |
| Gauge | temperature_celsius | Value that goes up/down |
| Histogram | request_duration_seconds | Distribution of values |
| Summary | request_latency_quantiles | Quantile calculations |
Naming Convention
<namespace>_<name>_<unit>
Examples:
- http_requests_total
- http_request_duration_seconds
- db_connections_active
- queue_messages_waiting
RED Method (Services)
- Rate: Requests per second
- Error: Error rate
- Duration: Latency distribution
USE Method (Resources)
- Utilization: % time busy
- Saturation: Queue depth
- Errors: Error count
Golden Signals
- Latency (response time)
- Traffic (requests/sec)
- Errors (error rate)
- Saturation (resource utilization)
Distributed Tracing
Trace Structure
Trace (trace_id: abc123)
├── Span: HTTP Request (span_id: 001, parent: null)
│ ├── Span: Auth Check (span_id: 002, parent: 001)
│ ├── Span: DB Query (span_id: 003, parent: 001)
│ │ └── Span: Connection Pool (span_id: 004, parent: 003)
│ └── Span: External API (span_id: 005, parent: 001)
Context Propagation
# HTTP Headers
traceparent: 00-abc123-def456-01
tracestate: vendor=value
Sampling Strategies
| Strategy | Use Case |
|---|---|
| Always sample | Development, low traffic |
| Probabilistic | Production (1-10%) |
| Rate limiting | Control volume |
| Tail-based | Capture errors/slow requests |
Alerting
Alert Design
# Good alert
name: High Error Rate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.01
for: 5m
severity: critical
annotations:
summary: "Error rate above 1% for 5 minutes"
runbook: "https://wiki/runbooks/high-error-rate"
Alert Quality
- Actionable: Clear remediation steps
- Relevant: Indicates real problems
- Timely: Fast enough to matter
- Not noisy: Avoid alert fatigue
SLOs and Error Budgets
SLI: 99.9% of requests complete in < 200ms
SLO: 99.9% availability per month
Error Budget: 0.1% = 43.2 minutes downtime/month
Dashboards
Layout Principles
- Overview first: Key metrics at top
- Then details: Drill-down sections
- Time alignment: Consistent time ranges
- Annotations: Mark deployments/incidents
Essential Panels
- Request rate (traffic)
- Error rate (errors)
- Latency percentiles (P50, P95, P99)
- Resource utilization (CPU, memory)
- Queue depths (saturation)
chat Comments (0)
Sign in to join the discussion and leave a comment.
Skill Details
GitHub Stars
0
GitHub Forks
0
Created
Jan 2026
Last Updated
il y a 4 mois
tools
tools productivity tools
Related Skills
Build your own?
Join 12,000+ developers contributing to the Claude ecosystem.
No comments yet. Be the first to share your thoughts!