# Debug recipes: actor-system forensics Field-tested techniques for diagnosing hangs, wedges and cross-actor state bugs WITHOUT a debugger attached (or when `py-spy` ain't installed). ## Wedged actor triage (no REPL) 1. find the tree: `ps -eo pid,etime,args | grep -E 'pytest|tractor._child'` — long-`etime` `tractor._child` procs w/ a stuck parent = wedge. 2. kernel state: `cat /proc//wchan` + `status | grep -E 'State|Threads'` — `do_epoll_wait` + sleeping = idle event loop, NOT cpu-spin. 3. **the money read** — socket queues: `ss -tnp | grep ` - `Recv-Q > 0` on the parent-IPC conn = the actor STOPPED CONSUMING its msg loop (runtime bug), parent is waiting on it. - zero external (api/ws) conns = wedged before/ without provider IO; don't blame the network. - `CLOSE-WAIT` lingerers = unclean peer teardown. 4. cleanup: `pkill -f tractor._child` (NB: in compound shell cmds `pkill`'s exit code poisons `&&` chains — run it standalone). ## Hang-proof test gating - per-suite, never combined (cross-suite session state interacts w/ the 2nd-boot wedge): `timeout -k 5 300 python -m pytest tests/.py -q` - rc 124/143 = hang-kill -> retry ONCE before investigating. - isolate a flaky test w/ a 3x loop; ~50% hit-rate signatures match the known 2nd-boot wedge (see gotchas.md). ## Regression vs pre-existing attribution When a failure appears mid-refactor: 1. `git stash -u` (or checkout the file subset) and re-run the EXACT failing case at baseline. 2. if baseline can't even run, selectively revert ONLY the suspect layer: `git diff > /tmp/x.patch; git checkout ` -> test -> `git apply /tmp/x.patch`. 3. flake-rate compare (3x runs) beats single-shot conclusions. ## Off-by-one / stale IPC resp detection Mismatched query->result content in logs (resp payload obviously for a prior request) = shared req/resp channel w/o correlation + a cancelled caller. Grep the ep for `move_on_after`/`fail_after` around proxied calls. Fix = req-id (`mid`) tagging, never "just a lock" (cancellation still orphans). ## Logging-chain audits When records double-print or go bare (see gotchas.md): ```python import logging l = logging.getLogger('piker.brokers.ib.broker') while l: print(l.name, l.level, l.handlers, l.propagate) l = l.parent ``` Exactly ONE stderr handler should exist in the chain, attached by the actor's daemon fixture. ## Live actor-tree smoke (headless) Boot against an ALT registry port so a user's running stack is untouched; script in a REAL file (tractor children re-exec `__main__` from path — stdin scripts crash w/ `FileNotFoundError: .../`): ```python async with maybe_open_pikerd( registry_addrs=[('127.0.0.1', 6979)], ): async with open_feed(['xbtusdt.kraken']) as feed: assert await check_for_service('datad.kraken') assert not await check_for_service( 'brokerd.kraken' ) ``` ## In-proc fail-fast unit checks Spawn-path guards that raise BEFORE touching the runtime can be tested w/ a bare `trio.run()` (eg `spawn_brokerd('kucoin')` raising the datad-only error) — no pikerd needed.