Codify capture-pipe hang lesson in skills

Encode the hard-won lesson from the forkserver cancel-cascade investigation into two skill docs so future sessions grep-find it before spelunking into trio internals. Deats, - `.claude/skills/conc-anal/SKILL.md`: - new "Unbounded waits in cleanup paths" section — rule: bound every `await X.wait()` in cleanup paths with `trio.move_on_after()` unless the setter is unconditionally reachable. Recent example: `ipc_server.wait_for_no_more_peers()` in `async_main`'s finally (was unbounded, deadlocked when any peer handler stuck) - new "The capture-pipe-fill hang pattern" section — mechanism, grep-pointers to the existing `conftest.py` guards (`tests/conftest .py:258`, `:316`), cross-ref to the full post-mortem doc, and the grep-note: "if a multi-subproc tractor test hangs, `pytest -s` first, conc-anal second" - `.claude/skills/run-tests/SKILL.md`: new "Section 9: The pytest-capture hang pattern (CHECK THIS FIRST)" with symptom / cause / pre-existing guards to grep / three-step debug recipe (try `-s`, lower loglevel, redirect stdout/stderr) / signature of this bug vs. a real code hang / historical reference Cost several investigation sessions before the capture-pipe issue surfaced — it was masked by deeper cascade deadlocks. Once the cascades were fixed, the tree tore down enough to generate pipe-filling log volume. Lesson: **grep this pattern first when any multi-subproc tractor test hangs under default pytest but passes with `-s`.** (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-23 23:22:40 -04:00 · 2026-04-23 23:22:40 -04:00 · 4106ba73ea
parent eceed29d4a
commit 4106ba73ea
2 changed files with 136 additions and 0 deletions
--- a/.claude/skills/conc-anal/SKILL.md
+++ b/.claude/skills/conc-anal/SKILL.md
@ -229,3 +229,69 @@ Unlike asyncio, trio allows checkpoints in
 that does `await` can itself be cancelled (e.g.
 by nursery shutdown). Watch for cleanup code that
 assumes it will run to completion.
 ### Unbounded waits in cleanup paths
 Any `await <event>.wait()` in a teardown path is
 a latent deadlock unless the event's setter is
 GUARANTEED to fire. If the setter depends on
 external state (peer disconnects, child process
 exit, subsequent task completion) that itself
 depends on the current task's progress, you have
 a mutual wait.
 Rule: **bound every `await X.wait()` in cleanup
 paths with `trio.move_on_after()`** unless you
 can prove the setter is unconditionally reachable
 from the state at the await site. Concrete recent
 example: `ipc_server.wait_for_no_more_peers()` in
 `async_main`'s finally (see
 `ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md`
 "probe iteration 3") — it was unbounded, and when
 one peer-handler was stuck the wait-for-no-more-
 peers event never fired, deadlocking the whole
 actor-tree teardown cascade.
 ### The capture-pipe-fill hang pattern (grep this first)
 When investigating any hang in the test suite
 **especially under fork-based backends**, first
 check whether the hang reproduces under `pytest
 -s` (`--capture=no`). If `-s` makes it go away
 you're not looking at a trio concurrency bug —
 you're looking at a Linux pipe-buffer fill.
 Mechanism: pytest replaces fds 1,2 with pipe
 write-ends. Fork-child subactors inherit those
 fds. High-volume error-log tracebacks (cancel
 cascade spew) fill the 64KB pipe buffer. Child
 `write()` blocks. Child can't exit. Parent's
 `waitpid`/pidfd wait blocks. Deadlock cascades up
 the tree.
 Pre-existing guards in `tests/conftest.py` encode
 this knowledge — grep these BEFORE blaming
 concurrency:
 ```python
 # tests/conftest.py:258
 if loglevel in ('trace', 'debug'):
    # XXX: too much logging will lock up the subproc (smh)
    loglevel: str = 'info'
 # tests/conftest.py:316
 # can lock up on the `_io.BufferedReader` and hang..
 stderr: str = proc.stderr.read().decode()
 ```
 Full post-mortem +
 `ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md`
 for the canonical reproduction. Cost several
 investigation sessions before catching it —
 because the capture-pipe symptom was masked by
 deeper cascade-deadlocks. Once the cascades were
 fixed, the tree tore down enough to generate
 pipe-filling log volume → capture-pipe finally
 surfaced. Grep-note for future-self: **if a
 multi-subproc tractor test hangs, `pytest -s`
 first, conc-anal second.**
--- a/.claude/skills/run-tests/SKILL.md
+++ b/.claude/skills/run-tests/SKILL.md
@ -451,3 +451,73 @@ by your changes — note them and move on.
 **Rule of thumb**: if a test fails with `TooSlowError`,
 `trio.TooSlowError`, or `pexpect.TIMEOUT` and you didn't
 touch the relevant code path, it's flaky — skip it.
 ## 9. The pytest-capture hang pattern (CHECK THIS FIRST)
 **Symptom:** a tractor test hangs indefinitely under
 default `pytest` but passes instantly when you add
 `-s` (`--capture=no`).
 **Cause:** tractor subactors (especially under fork-
 based backends) inherit pytest's stdout/stderr
 capture pipes via fds 1,2. Under high-volume error
 logging (e.g. multi-level cancel cascade, nested
 `run_in_actor` failures, anything triggering
 `RemoteActorError` + `ExceptionGroup` traceback
 spew), the **64KB Linux pipe buffer fills** faster
 than pytest drains it. Subactor writes block → can't
 finish exit → parent's `waitpid`/pidfd wait blocks →
 deadlock cascades up the tree.
 **Pre-existing guards in the tractor harness** that
 encode this same knowledge — grep these FIRST
 before spelunking:
 - `tests/conftest.py:258-260` (in the `daemon`
  fixture): `# XXX: too much logging will lock up
  the subproc (smh)` — downgrades `trace`/`debug`
  loglevel to `info` to prevent the hang.
 - `tests/conftest.py:316`: `# can lock up on the
  _io.BufferedReader and hang..` — noted on the
  `proc.stderr.read()` post-SIGINT.
 **Debug recipe (in priority order):**
 1. **Try `-s` first.** If the hang disappears with
   `pytest -s`, you've confirmed it's capture-pipe
   fill. Skip spelunking.
 2. **Lower the loglevel.** Default `--ll=error` on
   this project; if you've bumped it to `debug` /
   `info`, try dropping back. Each log level
   multiplies pipe-pressure under fault cascades.
 3. **If you MUST use default capture + high log
   volume**, redirect subactor stdout/stderr in the
   child prelude (e.g.
   `tractor.spawn._subint_forkserver._child_target`
   post-`_close_inherited_fds`) to `/dev/null` or a
   file.
 **Signature tells you it's THIS bug (vs. a real
 code hang):**
 - Multi-actor test under fork-based backend
  (`subint_forkserver`, eventually `trio_proc` too
  under enough log volume).
 - Multiple `RemoteActorError` / `ExceptionGroup`
  tracebacks in the error path.
 - Test passes with `-s` in the 5-10s range, hangs
  past pytest-timeout (usually 30+ s) without `-s`.
 - Subactor processes visible via `pgrep -af
  subint-forkserv` or similar after the hang —
  they're alive but blocked on `write()` to an
  inherited stdout fd.
 **Historical reference:** this deadlock cost a
 multi-session investigation (4 genuine cascade
 fixes landed along the way) that only surfaced the
 capture-pipe issue AFTER the deeper fixes let the
 tree actually tear down enough to produce pipe-
 filling log volume. Full post-mortem in
 `ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md`.
 Lesson codified here so future-me grep-finds the
 workaround before digging.