piker/.claude/skills/piker-conc-expert/debug-recipes.md

3.1 KiB
Raw Blame History

Debug recipes: actor-system forensics

Field-tested techniques for diagnosing hangs, wedges and cross-actor state bugs WITHOUT a debugger attached (or when py-spy aint installed).

Wedged actor triage (no REPL)

  1. find the tree: ps -eo pid,etime,args | grep -E 'pytest|tractor._child' — long-etime tractor._child procs w/ a stuck parent = wedge.
  2. kernel state: cat /proc/<pid>/wchan + status | grep -E 'State|Threads'do_epoll_wait + sleeping = idle event loop, NOT cpu-spin.
  3. the money read — socket queues: ss -tnp | grep <pid>
    • Recv-Q > 0 on the parent-IPC conn = the actor STOPPED CONSUMING its msg loop (runtime bug), parent is waiting on it.
    • zero external (api/ws) conns = wedged before/ without provider IO; dont blame the network.
    • CLOSE-WAIT lingerers = unclean peer teardown.
  4. cleanup: pkill -f tractor._child (NB: in compound shell cmds pkills exit code poisons && chains — run it standalone).

Hang-proof test gating

  • per-suite, never combined (cross-suite session state interacts w/ the 2nd-boot wedge): timeout -k 5 300 python -m pytest tests/<one>.py -q
  • rc 124/143 = hang-kill -> retry ONCE before investigating.
  • isolate a flaky test w/ a 3x loop; ~50% hit-rate signatures match the known 2nd-boot wedge (see gotchas.md).

Regression vs pre-existing attribution

When a failure appears mid-refactor: 1. git stash -u (or checkout the file subset) and re-run the EXACT failing case at baseline. 2. if baseline cant even run, selectively revert ONLY the suspect layer: git diff <files> > /tmp/x.patch; git checkout <files> -> test -> git apply /tmp/x.patch. 3. flake-rate compare (3x runs) beats single-shot conclusions.

Off-by-one / stale IPC resp detection

Mismatched query->result content in logs (resp payload obviously for a prior request) = shared req/resp channel w/o correlation + a cancelled caller. Grep the ep for move_on_after/fail_after around proxied calls. Fix = req-id (mid) tagging, never “just a lock” (cancellation still orphans).

Logging-chain audits

When records double-print or go bare (see gotchas.md):

import logging
l = logging.getLogger('piker.brokers.ib.broker')
while l:
    print(l.name, l.level, l.handlers, l.propagate)
    l = l.parent

Exactly ONE stderr handler should exist in the chain, attached by the actors daemon fixture.

Live actor-tree smoke (headless)

Boot against an ALT registry port so a users running stack is untouched; script in a REAL file (tractor children re-exec __main__ from path — stdin scripts crash w/ FileNotFoundError: .../<stdin>):

async with maybe_open_pikerd(
    registry_addrs=[('127.0.0.1', 6979)],
):
    async with open_feed(['xbtusdt.kraken']) as feed:
        assert await check_for_service('datad.kraken')
        assert not await check_for_service(
            'brokerd.kraken'
        )

In-proc fail-fast unit checks

Spawn-path guards that raise BEFORE touching the runtime can be tested w/ a bare trio.run() (eg spawn_brokerd('kucoin') raising the datad-only error) — no pikerd needed.