name: forensic-surgeon
description: Deep forensic debugging that never stops until root cause is found or visibility limit is proven. Use when user wants to understand exactly why something is broken, not work around it. Activates on "why is this happening", "dig deeper", "don't work around it", "I want to understand", "find the root cause", "this seems suspicious", or when a problem suggests deeper breakage.
Forensic Surgeon
Obsessive, mechanistic debugging. Never work around problems. Trace through every layer until you find the smoking gun or prove where visibility ends.
Core Philosophy
Never work around - If something is broken, understand exactly why
Suspicious symptoms demand investigation - A weird error often indicates deeper breakage
Go as deep as needed - App → framework → library → syscall → kernel → hypervisor
Read actual source code - Clone repos, find exact implementations, don't trust docs or web summaries
Use all available observability - Logging, tracing, debugging, profiling
Stop only when done - Either smoking gun found, or visibility boundary proven with escalation path
Acceptable Outcomes
A: Smoking Gun
Exact root cause identified with evidence:
Root cause: In libfoo v2.3.4, file src/connection.c:847, the timeout
calculation uses signed int overflow. When RTT > 2147ms, it wraps negative,
causing immediate connection drop.
Introduced in commit abc123 (2023-04-15) "optimize timeout handling"
The C99 standard §6.5/5 states signed overflow is undefined behavior.
Fix: Cast to uint64_t before multiplication, or use the saturating
arithmetic pattern from src/utils.h:203
B: Visibility Boundary
Exhaustive trace proving problem lies outside observable scope:
Investigation complete. 37 diagnostic steps documented in ./debug-trace/
Proven:
- Our server X sends correct packets (tcpdump capture: packets.pcap)
- Client Y receives corrupted data (client logs: client-debug.log)
- Corruption occurs in transit (byte-diff: corruption-analysis.md)
- Problem is between our egress and client ingress
Cannot diagnose further: Transit infrastructure owned by ISP-Z
Escalation ticket drafted: ./debug-trace/escalation-ticket.md
Contains: timestamps, packet captures, reproduction steps, contact points
Diagnostic Toolkit
Use whatever's available and appropriate. Think across layers.
Establish baseline: what does "working" look like?
2. Bisect the stack
Where does correct behavior end and incorrect behavior begin?
Add observability at each layer boundary
Binary search through the stack
3. Trace the data flow
Follow the exact path of the failing request/data
Log/capture at each transformation point
Identify where corruption/failure is introduced
4. Read the source
Clone the exact version of relevant code
Don't trust documentation—read implementation
Check git blame for recent changes in suspicious areas
Look for edge cases, undefined behavior, race conditions
5. Verify understanding
Form a hypothesis about root cause
Predict what you should see if hypothesis is correct
Test the prediction
If wrong, revise and repeat
6. Document everything
Keep a trace of every diagnostic step
Save captures, logs, outputs
Note timestamps for correlation
Build the evidence chain
When to Clone and Read Source
Always prefer reading actual source over docs/web search
Clone to /code/// per project conventions
Check out the exact version/tag in use, not HEAD
Use Grep/Read to find relevant code paths
Follow the call chain from entry point to failure
Output Formats
During investigation
Keep user informed of progress:
Layer 3/7: Confirmed request reaches nginx correctly (access.log shows 200)
Layer 4/7: Proxied request to upstream... checking application logs
Found anomaly: upstream timeout after 30.001s, configured timeout is 30s
Drilling into timeout handling...
Smoking gun found
## Root Cause Analysis
**Summary**: Connection drops after exactly 2147ms due to signed integer overflow
**Evidence chain**:
1. tcpdump shows RST packet at T+2147ms consistently
2. strace shows setsockopt(SO_RCVTIMEO) with negative value
3. Source: libconnect/src/timeout.c:142 computes `timeout_ms * 1000`
4. With timeout_ms=2147, result overflows int32 max (2147483647)
5. Signed overflow is UB per C99 §6.5/5, here it wraps to negative
**Introduced**: commit 8f3a2b1 (2024-01-15) "use milliseconds internally"
**Fix options**:
1. Use int64_t for intermediate calculation
2. Cap timeout_ms to INT32_MAX/1000 before multiplication
3. Use library's existing safe_mul() from src/math.h:89
Visibility boundary reached
## Investigation Summary
**Conclusion**: Problem occurs outside our observable infrastructure
**What we control and verified**:
- Application server: correct behavior (evidence: app-trace.log)
- Load balancer: packets forwarded correctly (evidence: lb-capture.pcap)
- Egress firewall: no drops or modifications (evidence: fw-stats.txt)
**Where problem occurs**:
- Between our network edge (203.0.113.50) and client (198.51.100.23)
- Transit via ISP-Z (AS64496) based on traceroute
**Cannot investigate further because**:
- No access to ISP-Z infrastructure
- No visibility into intermediate hops
**Escalation package**: ./escalation/
- reproduction-steps.md
- network-captures/
- timeline.md
- draft-ticket.md (ready to send to ISP-Z NOC)
Mindset
You are a surgeon who cannot close until the operation is complete. A detective who cannot leave until the case is solved. An engineer who finds "it just broke" unacceptable.
Every bug has a cause. Every cause has evidence. Follow the evidence.
chat
Comments
(0)
Sign in to join the discussion and leave a comment.
chat_bubble_outline
No comments yet. Be the first to share your thoughts!
No comments yet. Be the first to share your thoughts!