piker/.claude/skills/piker-conc-expert/debug-recipes.md

101 lines
3.1 KiB
Markdown

# Debug recipes: actor-system forensics
Field-tested techniques for diagnosing hangs, wedges
and cross-actor state bugs WITHOUT a debugger attached
(or when `py-spy` ain't installed).
## Wedged actor triage (no REPL)
1. find the tree:
`ps -eo pid,etime,args | grep -E 'pytest|tractor._child'`
— long-`etime` `tractor._child` procs w/ a stuck
parent = wedge.
2. kernel state:
`cat /proc/<pid>/wchan` + `status | grep -E
'State|Threads'``do_epoll_wait` + sleeping =
idle event loop, NOT cpu-spin.
3. **the money read** — socket queues:
`ss -tnp | grep <pid>`
- `Recv-Q > 0` on the parent-IPC conn = the actor
STOPPED CONSUMING its msg loop (runtime bug),
parent is waiting on it.
- zero external (api/ws) conns = wedged before/
without provider IO; don't blame the network.
- `CLOSE-WAIT` lingerers = unclean peer teardown.
4. cleanup: `pkill -f tractor._child` (NB: in
compound shell cmds `pkill`'s exit code poisons
`&&` chains — run it standalone).
## Hang-proof test gating
- per-suite, never combined (cross-suite session
state interacts w/ the 2nd-boot wedge):
`timeout -k 5 300 python -m pytest tests/<one>.py -q`
- rc 124/143 = hang-kill -> retry ONCE before
investigating.
- isolate a flaky test w/ a 3x loop; ~50% hit-rate
signatures match the known 2nd-boot wedge (see
gotchas.md).
## Regression vs pre-existing attribution
When a failure appears mid-refactor:
1. `git stash -u` (or checkout the file subset) and
re-run the EXACT failing case at baseline.
2. if baseline can't even run, selectively revert
ONLY the suspect layer:
`git diff <files> > /tmp/x.patch;
git checkout <files>` -> test ->
`git apply /tmp/x.patch`.
3. flake-rate compare (3x runs) beats single-shot
conclusions.
## Off-by-one / stale IPC resp detection
Mismatched query->result content in logs (resp
payload obviously for a prior request) = shared
req/resp channel w/o correlation + a cancelled
caller. Grep the ep for `move_on_after`/`fail_after`
around proxied calls. Fix = req-id (`mid`) tagging,
never "just a lock" (cancellation still orphans).
## Logging-chain audits
When records double-print or go bare (see gotchas.md):
```python
import logging
l = logging.getLogger('piker.brokers.ib.broker')
while l:
print(l.name, l.level, l.handlers, l.propagate)
l = l.parent
```
Exactly ONE stderr handler should exist in the chain,
attached by the actor's daemon fixture.
## Live actor-tree smoke (headless)
Boot against an ALT registry port so a user's running
stack is untouched; script in a REAL file (tractor
children re-exec `__main__` from path — stdin scripts
crash w/ `FileNotFoundError: .../<stdin>`):
```python
async with maybe_open_pikerd(
registry_addrs=[('127.0.0.1', 6979)],
):
async with open_feed(['xbtusdt.kraken']) as feed:
assert await check_for_service('datad.kraken')
assert not await check_for_service(
'brokerd.kraken'
)
```
## In-proc fail-fast unit checks
Spawn-path guards that raise BEFORE touching the
runtime can be tested w/ a bare `trio.run()` (eg
`spawn_brokerd('kucoin')` raising the datad-only
error) — no pikerd needed.