3.1 KiB
Debug recipes: actor-system forensics
Field-tested techniques for diagnosing hangs, wedges and cross-actor state bugs WITHOUT a debugger attached (or when py-spy ain’t installed).
Wedged actor triage (no REPL)
- find the tree:
ps -eo pid,etime,args | grep -E 'pytest|tractor._child'— long-etimetractor._childprocs w/ a stuck parent = wedge. - kernel state:
cat /proc/<pid>/wchan+status | grep -E 'State|Threads'—do_epoll_wait+ sleeping = idle event loop, NOT cpu-spin. - the money read — socket queues:
ss -tnp | grep <pid>Recv-Q > 0on the parent-IPC conn = the actor STOPPED CONSUMING its msg loop (runtime bug), parent is waiting on it.- zero external (api/ws) conns = wedged before/ without provider IO; don’t blame the network.
CLOSE-WAITlingerers = unclean peer teardown.
- cleanup:
pkill -f tractor._child(NB: in compound shell cmdspkill’s exit code poisons&&chains — run it standalone).
Hang-proof test gating
- per-suite, never combined (cross-suite session state interacts w/ the 2nd-boot wedge):
timeout -k 5 300 python -m pytest tests/<one>.py -q - rc 124/143 = hang-kill -> retry ONCE before investigating.
- isolate a flaky test w/ a 3x loop; ~50% hit-rate signatures match the known 2nd-boot wedge (see gotchas.md).
Regression vs pre-existing attribution
When a failure appears mid-refactor: 1. git stash -u (or checkout the file subset) and re-run the EXACT failing case at baseline. 2. if baseline can’t even run, selectively revert ONLY the suspect layer: git diff <files> > /tmp/x.patch; git checkout <files> -> test -> git apply /tmp/x.patch. 3. flake-rate compare (3x runs) beats single-shot conclusions.
Off-by-one / stale IPC resp detection
Mismatched query->result content in logs (resp payload obviously for a prior request) = shared req/resp channel w/o correlation + a cancelled caller. Grep the ep for move_on_after/fail_after around proxied calls. Fix = req-id (mid) tagging, never “just a lock” (cancellation still orphans).
Logging-chain audits
When records double-print or go bare (see gotchas.md):
import logging
l = logging.getLogger('piker.brokers.ib.broker')
while l:
print(l.name, l.level, l.handlers, l.propagate)
l = l.parentExactly ONE stderr handler should exist in the chain, attached by the actor’s daemon fixture.
Live actor-tree smoke (headless)
Boot against an ALT registry port so a user’s running stack is untouched; script in a REAL file (tractor children re-exec __main__ from path — stdin scripts crash w/ FileNotFoundError: .../<stdin>):
async with maybe_open_pikerd(
registry_addrs=[('127.0.0.1', 6979)],
):
async with open_feed(['xbtusdt.kraken']) as feed:
assert await check_for_service('datad.kraken')
assert not await check_for_service(
'brokerd.kraken'
)In-proc fail-fast unit checks
Spawn-path guards that raise BEFORE touching the runtime can be tested w/ a bare trio.run() (eg spawn_brokerd('kucoin') raising the datad-only error) — no pikerd needed.