Codify capture-pipe hang lesson in skills

Encode the hard-won lesson from the forkserver cancel-cascade investigation into two skill docs so future sessions grep-find it before spelunking into trio internals. Deats, - `.claude/skills/conc-anal/SKILL.md`: - new "Unbounded waits in cleanup paths" section — rule: bound every `await X.wait()` in cleanup paths with `trio.move_on_after()` unless the setter is unconditionally reachable. Recent example: `ipc_server.wait_for_no_more_peers()` in `async_main`'s finally (was unbounded, deadlocked when any peer handler stuck) - new "The capture-pipe-fill hang pattern" section — mechanism, grep-pointers to the existing `conftest.py` guards (`tests/conftest .py:258`, `:316`), cross-ref to the full post-mortem doc, and the grep-note: "if a multi-subproc tractor test hangs, `pytest -s` first, conc-anal second" - `.claude/skills/run-tests/SKILL.md`: new "Section 9: The pytest-capture hang pattern (CHECK THIS FIRST)" with symptom / cause / pre-existing guards to grep / three-step debug recipe (try `-s`, lower loglevel, redirect stdout/stderr) / signature of this bug vs. a real code hang / historical reference Cost several investigation sessions before the capture-pipe issue surfaced — it was masked by deeper cascade deadlocks. Once the cascades were fixed, the tree tore down enough to generate pipe-filling log volume. Lesson: **grep this pattern first when any multi-subproc tractor test hangs under default pytest but passes with `-s`.** (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-23 23:22:40 -04:00 · 2026-04-23 23:22:40 -04:00 · 4106ba73ea
parent eceed29d4a
commit 4106ba73ea
2 changed files with 136 additions and 0 deletions
--- a/.claude/skills/conc-anal/SKILL.md
+++ b/.claude/skills/conc-anal/SKILL.md
@ -229,3 +229,69 @@ Unlike asyncio, trio allows checkpoints in
 that does `await` can itself be cancelled (e.g.
 by nursery shutdown). Watch for cleanup code that
 assumes it will run to completion.
+
+### Unbounded waits in cleanup paths
+
+Any `await <event>.wait()` in a teardown path is
+a latent deadlock unless the event's setter is
+GUARANTEED to fire. If the setter depends on
+external state (peer disconnects, child process
+exit, subsequent task completion) that itself
+depends on the current task's progress, you have
+a mutual wait.
+
+Rule: **bound every `await X.wait()` in cleanup
+paths with `trio.move_on_after()`** unless you
+can prove the setter is unconditionally reachable
+from the state at the await site. Concrete recent
+example: `ipc_server.wait_for_no_more_peers()` in
+`async_main`'s finally (see
+`ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md`
+"probe iteration 3") — it was unbounded, and when
+one peer-handler was stuck the wait-for-no-more-
+peers event never fired, deadlocking the whole
+actor-tree teardown cascade.
+
+### The capture-pipe-fill hang pattern (grep this first)
+
+When investigating any hang in the test suite
+**especially under fork-based backends**, first
+check whether the hang reproduces under `pytest
+-s` (`--capture=no`). If `-s` makes it go away
+you're not looking at a trio concurrency bug —
+you're looking at a Linux pipe-buffer fill.
+
+Mechanism: pytest replaces fds 1,2 with pipe
+write-ends. Fork-child subactors inherit those
+fds. High-volume error-log tracebacks (cancel
+cascade spew) fill the 64KB pipe buffer. Child
+`write()` blocks. Child can't exit. Parent's
+`waitpid`/pidfd wait blocks. Deadlock cascades up
+the tree.
+
+Pre-existing guards in `tests/conftest.py` encode
+this knowledge — grep these BEFORE blaming
+concurrency:
+
+```python
+# tests/conftest.py:258
+if loglevel in ('trace', 'debug'):
+    # XXX: too much logging will lock up the subproc (smh)
+    loglevel: str = 'info'
+
+# tests/conftest.py:316
+# can lock up on the `_io.BufferedReader` and hang..
+stderr: str = proc.stderr.read().decode()
+```
+
+Full post-mortem +
+`ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md`
+for the canonical reproduction. Cost several
+investigation sessions before catching it —
+because the capture-pipe symptom was masked by
+deeper cascade-deadlocks. Once the cascades were
+fixed, the tree tore down enough to generate
+pipe-filling log volume → capture-pipe finally
+surfaced. Grep-note for future-self: **if a
+multi-subproc tractor test hangs, `pytest -s`
+first, conc-anal second.**
--- a/.claude/skills/run-tests/SKILL.md
+++ b/.claude/skills/run-tests/SKILL.md
@ -451,3 +451,73 @@ by your changes — note them and move on.
 **Rule of thumb**: if a test fails with `TooSlowError`,
 `trio.TooSlowError`, or `pexpect.TIMEOUT` and you didn't
 touch the relevant code path, it's flaky — skip it.
+
+## 9. The pytest-capture hang pattern (CHECK THIS FIRST)
+
+**Symptom:** a tractor test hangs indefinitely under
+default `pytest` but passes instantly when you add
+`-s` (`--capture=no`).
+
+**Cause:** tractor subactors (especially under fork-
+based backends) inherit pytest's stdout/stderr
+capture pipes via fds 1,2. Under high-volume error
+logging (e.g. multi-level cancel cascade, nested
+`run_in_actor` failures, anything triggering
+`RemoteActorError` + `ExceptionGroup` traceback
+spew), the **64KB Linux pipe buffer fills** faster
+than pytest drains it. Subactor writes block → can't
+finish exit → parent's `waitpid`/pidfd wait blocks →
+deadlock cascades up the tree.
+
+**Pre-existing guards in the tractor harness** that
+encode this same knowledge — grep these FIRST
+before spelunking:
+
+- `tests/conftest.py:258-260` (in the `daemon`
+  fixture): `# XXX: too much logging will lock up
+  the subproc (smh)` — downgrades `trace`/`debug`
+  loglevel to `info` to prevent the hang.
+- `tests/conftest.py:316`: `# can lock up on the
+  _io.BufferedReader and hang..` — noted on the
+  `proc.stderr.read()` post-SIGINT.
+
+**Debug recipe (in priority order):**
+
+1. **Try `-s` first.** If the hang disappears with
+   `pytest -s`, you've confirmed it's capture-pipe
+   fill. Skip spelunking.
+2. **Lower the loglevel.** Default `--ll=error` on
+   this project; if you've bumped it to `debug` /
+   `info`, try dropping back. Each log level
+   multiplies pipe-pressure under fault cascades.
+3. **If you MUST use default capture + high log
+   volume**, redirect subactor stdout/stderr in the
+   child prelude (e.g.
+   `tractor.spawn._subint_forkserver._child_target`
+   post-`_close_inherited_fds`) to `/dev/null` or a
+   file.
+
+**Signature tells you it's THIS bug (vs. a real
+code hang):**
+
+- Multi-actor test under fork-based backend
+  (`subint_forkserver`, eventually `trio_proc` too
+  under enough log volume).
+- Multiple `RemoteActorError` / `ExceptionGroup`
+  tracebacks in the error path.
+- Test passes with `-s` in the 5-10s range, hangs
+  past pytest-timeout (usually 30+ s) without `-s`.
+- Subactor processes visible via `pgrep -af
+  subint-forkserv` or similar after the hang —
+  they're alive but blocked on `write()` to an
+  inherited stdout fd.
+
+**Historical reference:** this deadlock cost a
+multi-session investigation (4 genuine cascade
+fixes landed along the way) that only surfaced the
+capture-pipe issue AFTER the deeper fixes let the
+tree actually tear down enough to produce pipe-
+filling log volume. Full post-mortem in
+`ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md`.
+Lesson codified here so future-me grep-finds the
+workaround before digging.