piker/.claude/skills/piker-conc-expert/debug-recipes.md

# Debug recipes: actor-system forensics

Field-tested techniques for diagnosing hangs, wedges
and cross-actor state bugs WITHOUT a debugger attached
(or when `py-spy` ain't installed).

## Wedged actor triage (no REPL)

1. find the tree:
   `ps -eo pid,etime,args | grep -E 'pytest|tractor._child'`
   — long-`etime` `tractor._child` procs w/ a stuck
   parent = wedge.
2. kernel state:
   `cat /proc/<pid>/wchan` + `status | grep -E
   'State|Threads'` — `do_epoll_wait` + sleeping =
   idle event loop, NOT cpu-spin.
3. **the money read** — socket queues:
   `ss -tnp | grep <pid>`
   - `Recv-Q > 0` on the parent-IPC conn = the actor
     STOPPED CONSUMING its msg loop (runtime bug),
     parent is waiting on it.
   - zero external (api/ws) conns = wedged before/
     without provider IO; don't blame the network.
   - `CLOSE-WAIT` lingerers = unclean peer teardown.
4. cleanup: `pkill -f tractor._child` (NB: in
   compound shell cmds `pkill`'s exit code poisons
   `&&` chains — run it standalone).

## Hang-proof test gating

- per-suite, never combined (cross-suite session
  state interacts w/ the 2nd-boot wedge):
  `timeout -k 5 300 python -m pytest tests/<one>.py -q`
- rc 124/143 = hang-kill -> retry ONCE before
  investigating.
- isolate a flaky test w/ a 3x loop; ~50% hit-rate
  signatures match the known 2nd-boot wedge (see
  gotchas.md).

## Regression vs pre-existing attribution

When a failure appears mid-refactor:
1. `git stash -u` (or checkout the file subset) and
   re-run the EXACT failing case at baseline.
2. if baseline can't even run, selectively revert
   ONLY the suspect layer:
   `git diff <files> > /tmp/x.patch;
    git checkout <files>` -> test ->
   `git apply /tmp/x.patch`.
3. flake-rate compare (3x runs) beats single-shot
   conclusions.

## Off-by-one / stale IPC resp detection

Mismatched query->result content in logs (resp
payload obviously for a prior request) = shared
req/resp channel w/o correlation + a cancelled
caller. Grep the ep for `move_on_after`/`fail_after`
around proxied calls. Fix = req-id (`mid`) tagging,
never "just a lock" (cancellation still orphans).

## Logging-chain audits

When records double-print or go bare (see gotchas.md):

```python
import logging
l = logging.getLogger('piker.brokers.ib.broker')
while l:
    print(l.name, l.level, l.handlers, l.propagate)
    l = l.parent
```

Exactly ONE stderr handler should exist in the chain,
attached by the actor's daemon fixture.

## Live actor-tree smoke (headless)

Boot against an ALT registry port so a user's running
stack is untouched; script in a REAL file (tractor
children re-exec `__main__` from path — stdin scripts
crash w/ `FileNotFoundError: .../<stdin>`):

```python
async with maybe_open_pikerd(
    registry_addrs=[('127.0.0.1', 6979)],
):
    async with open_feed(['xbtusdt.kraken']) as feed:
        assert await check_for_service('datad.kraken')
        assert not await check_for_service(
            'brokerd.kraken'
        )
```

## In-proc fail-fast unit checks

Spawn-path guards that raise BEFORE touching the
runtime can be tested w/ a bare `trio.run()` (eg
`spawn_brokerd('kucoin')` raising the datad-only
error) — no pikerd needed.