101 lines
3.1 KiB
Markdown
101 lines
3.1 KiB
Markdown
# Debug recipes: actor-system forensics
|
|
|
|
Field-tested techniques for diagnosing hangs, wedges
|
|
and cross-actor state bugs WITHOUT a debugger attached
|
|
(or when `py-spy` ain't installed).
|
|
|
|
## Wedged actor triage (no REPL)
|
|
|
|
1. find the tree:
|
|
`ps -eo pid,etime,args | grep -E 'pytest|tractor._child'`
|
|
— long-`etime` `tractor._child` procs w/ a stuck
|
|
parent = wedge.
|
|
2. kernel state:
|
|
`cat /proc/<pid>/wchan` + `status | grep -E
|
|
'State|Threads'` — `do_epoll_wait` + sleeping =
|
|
idle event loop, NOT cpu-spin.
|
|
3. **the money read** — socket queues:
|
|
`ss -tnp | grep <pid>`
|
|
- `Recv-Q > 0` on the parent-IPC conn = the actor
|
|
STOPPED CONSUMING its msg loop (runtime bug),
|
|
parent is waiting on it.
|
|
- zero external (api/ws) conns = wedged before/
|
|
without provider IO; don't blame the network.
|
|
- `CLOSE-WAIT` lingerers = unclean peer teardown.
|
|
4. cleanup: `pkill -f tractor._child` (NB: in
|
|
compound shell cmds `pkill`'s exit code poisons
|
|
`&&` chains — run it standalone).
|
|
|
|
## Hang-proof test gating
|
|
|
|
- per-suite, never combined (cross-suite session
|
|
state interacts w/ the 2nd-boot wedge):
|
|
`timeout -k 5 300 python -m pytest tests/<one>.py -q`
|
|
- rc 124/143 = hang-kill -> retry ONCE before
|
|
investigating.
|
|
- isolate a flaky test w/ a 3x loop; ~50% hit-rate
|
|
signatures match the known 2nd-boot wedge (see
|
|
gotchas.md).
|
|
|
|
## Regression vs pre-existing attribution
|
|
|
|
When a failure appears mid-refactor:
|
|
1. `git stash -u` (or checkout the file subset) and
|
|
re-run the EXACT failing case at baseline.
|
|
2. if baseline can't even run, selectively revert
|
|
ONLY the suspect layer:
|
|
`git diff <files> > /tmp/x.patch;
|
|
git checkout <files>` -> test ->
|
|
`git apply /tmp/x.patch`.
|
|
3. flake-rate compare (3x runs) beats single-shot
|
|
conclusions.
|
|
|
|
## Off-by-one / stale IPC resp detection
|
|
|
|
Mismatched query->result content in logs (resp
|
|
payload obviously for a prior request) = shared
|
|
req/resp channel w/o correlation + a cancelled
|
|
caller. Grep the ep for `move_on_after`/`fail_after`
|
|
around proxied calls. Fix = req-id (`mid`) tagging,
|
|
never "just a lock" (cancellation still orphans).
|
|
|
|
## Logging-chain audits
|
|
|
|
When records double-print or go bare (see gotchas.md):
|
|
|
|
```python
|
|
import logging
|
|
l = logging.getLogger('piker.brokers.ib.broker')
|
|
while l:
|
|
print(l.name, l.level, l.handlers, l.propagate)
|
|
l = l.parent
|
|
```
|
|
|
|
Exactly ONE stderr handler should exist in the chain,
|
|
attached by the actor's daemon fixture.
|
|
|
|
## Live actor-tree smoke (headless)
|
|
|
|
Boot against an ALT registry port so a user's running
|
|
stack is untouched; script in a REAL file (tractor
|
|
children re-exec `__main__` from path — stdin scripts
|
|
crash w/ `FileNotFoundError: .../<stdin>`):
|
|
|
|
```python
|
|
async with maybe_open_pikerd(
|
|
registry_addrs=[('127.0.0.1', 6979)],
|
|
):
|
|
async with open_feed(['xbtusdt.kraken']) as feed:
|
|
assert await check_for_service('datad.kraken')
|
|
assert not await check_for_service(
|
|
'brokerd.kraken'
|
|
)
|
|
```
|
|
|
|
## In-proc fail-fast unit checks
|
|
|
|
Spawn-path guards that raise BEFORE touching the
|
|
runtime can be tested w/ a bare `trio.run()` (eg
|
|
`spawn_brokerd('kucoin')` raising the datad-only
|
|
error) — no pikerd needed.
|