diff --git a/.claude/skills/conc-anal/SKILL.md b/.claude/skills/conc-anal/SKILL.md index 4f498b7c..fa121bb2 100644 --- a/.claude/skills/conc-anal/SKILL.md +++ b/.claude/skills/conc-anal/SKILL.md @@ -229,3 +229,69 @@ Unlike asyncio, trio allows checkpoints in that does `await` can itself be cancelled (e.g. by nursery shutdown). Watch for cleanup code that assumes it will run to completion. + +### Unbounded waits in cleanup paths + +Any `await .wait()` in a teardown path is +a latent deadlock unless the event's setter is +GUARANTEED to fire. If the setter depends on +external state (peer disconnects, child process +exit, subsequent task completion) that itself +depends on the current task's progress, you have +a mutual wait. + +Rule: **bound every `await X.wait()` in cleanup +paths with `trio.move_on_after()`** unless you +can prove the setter is unconditionally reachable +from the state at the await site. Concrete recent +example: `ipc_server.wait_for_no_more_peers()` in +`async_main`'s finally (see +`ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md` +"probe iteration 3") — it was unbounded, and when +one peer-handler was stuck the wait-for-no-more- +peers event never fired, deadlocking the whole +actor-tree teardown cascade. + +### The capture-pipe-fill hang pattern (grep this first) + +When investigating any hang in the test suite +**especially under fork-based backends**, first +check whether the hang reproduces under `pytest +-s` (`--capture=no`). If `-s` makes it go away +you're not looking at a trio concurrency bug — +you're looking at a Linux pipe-buffer fill. + +Mechanism: pytest replaces fds 1,2 with pipe +write-ends. Fork-child subactors inherit those +fds. High-volume error-log tracebacks (cancel +cascade spew) fill the 64KB pipe buffer. Child +`write()` blocks. Child can't exit. Parent's +`waitpid`/pidfd wait blocks. Deadlock cascades up +the tree. + +Pre-existing guards in `tests/conftest.py` encode +this knowledge — grep these BEFORE blaming +concurrency: + +```python +# tests/conftest.py:258 +if loglevel in ('trace', 'debug'): + # XXX: too much logging will lock up the subproc (smh) + loglevel: str = 'info' + +# tests/conftest.py:316 +# can lock up on the `_io.BufferedReader` and hang.. +stderr: str = proc.stderr.read().decode() +``` + +Full post-mortem + +`ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md` +for the canonical reproduction. Cost several +investigation sessions before catching it — +because the capture-pipe symptom was masked by +deeper cascade-deadlocks. Once the cascades were +fixed, the tree tore down enough to generate +pipe-filling log volume → capture-pipe finally +surfaced. Grep-note for future-self: **if a +multi-subproc tractor test hangs, `pytest -s` +first, conc-anal second.** diff --git a/.claude/skills/run-tests/SKILL.md b/.claude/skills/run-tests/SKILL.md index 3849ffe1..1c047332 100644 --- a/.claude/skills/run-tests/SKILL.md +++ b/.claude/skills/run-tests/SKILL.md @@ -451,3 +451,73 @@ by your changes — note them and move on. **Rule of thumb**: if a test fails with `TooSlowError`, `trio.TooSlowError`, or `pexpect.TIMEOUT` and you didn't touch the relevant code path, it's flaky — skip it. + +## 9. The pytest-capture hang pattern (CHECK THIS FIRST) + +**Symptom:** a tractor test hangs indefinitely under +default `pytest` but passes instantly when you add +`-s` (`--capture=no`). + +**Cause:** tractor subactors (especially under fork- +based backends) inherit pytest's stdout/stderr +capture pipes via fds 1,2. Under high-volume error +logging (e.g. multi-level cancel cascade, nested +`run_in_actor` failures, anything triggering +`RemoteActorError` + `ExceptionGroup` traceback +spew), the **64KB Linux pipe buffer fills** faster +than pytest drains it. Subactor writes block → can't +finish exit → parent's `waitpid`/pidfd wait blocks → +deadlock cascades up the tree. + +**Pre-existing guards in the tractor harness** that +encode this same knowledge — grep these FIRST +before spelunking: + +- `tests/conftest.py:258-260` (in the `daemon` + fixture): `# XXX: too much logging will lock up + the subproc (smh)` — downgrades `trace`/`debug` + loglevel to `info` to prevent the hang. +- `tests/conftest.py:316`: `# can lock up on the + _io.BufferedReader and hang..` — noted on the + `proc.stderr.read()` post-SIGINT. + +**Debug recipe (in priority order):** + +1. **Try `-s` first.** If the hang disappears with + `pytest -s`, you've confirmed it's capture-pipe + fill. Skip spelunking. +2. **Lower the loglevel.** Default `--ll=error` on + this project; if you've bumped it to `debug` / + `info`, try dropping back. Each log level + multiplies pipe-pressure under fault cascades. +3. **If you MUST use default capture + high log + volume**, redirect subactor stdout/stderr in the + child prelude (e.g. + `tractor.spawn._subint_forkserver._child_target` + post-`_close_inherited_fds`) to `/dev/null` or a + file. + +**Signature tells you it's THIS bug (vs. a real +code hang):** + +- Multi-actor test under fork-based backend + (`subint_forkserver`, eventually `trio_proc` too + under enough log volume). +- Multiple `RemoteActorError` / `ExceptionGroup` + tracebacks in the error path. +- Test passes with `-s` in the 5-10s range, hangs + past pytest-timeout (usually 30+ s) without `-s`. +- Subactor processes visible via `pgrep -af + subint-forkserv` or similar after the hang — + they're alive but blocked on `write()` to an + inherited stdout fd. + +**Historical reference:** this deadlock cost a +multi-session investigation (4 genuine cascade +fixes landed along the way) that only surfaced the +capture-pipe issue AFTER the deeper fixes let the +tree actually tear down enough to produce pipe- +filling log volume. Full post-mortem in +`ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md`. +Lesson codified here so future-me grep-finds the +workaround before digging.