Codify capture-pipe hang lesson in skills
Encode the hard-won lesson from the forkserver
cancel-cascade investigation into two skill docs
so future sessions grep-find it before spelunking
into trio internals.
Deats,
- `.claude/skills/conc-anal/SKILL.md`:
- new "Unbounded waits in cleanup paths"
section — rule: bound every `await X.wait()`
in cleanup paths with `trio.move_on_after()`
unless the setter is unconditionally
reachable. Recent example:
`ipc_server.wait_for_no_more_peers()` in
`async_main`'s finally (was unbounded,
deadlocked when any peer handler stuck)
- new "The capture-pipe-fill hang pattern"
section — mechanism, grep-pointers to the
existing `conftest.py` guards (`tests/conftest
.py:258`, `:316`), cross-ref to the full
post-mortem doc, and the grep-note: "if a
multi-subproc tractor test hangs, `pytest -s`
first, conc-anal second"
- `.claude/skills/run-tests/SKILL.md`: new
"Section 9: The pytest-capture hang pattern
(CHECK THIS FIRST)" with symptom / cause /
pre-existing guards to grep / three-step debug
recipe (try `-s`, lower loglevel, redirect
stdout/stderr) / signature of this bug vs. a
real code hang / historical reference
Cost several investigation sessions before the
capture-pipe issue surfaced — it was masked by
deeper cascade deadlocks. Once the cascades were
fixed, the tree tore down enough to generate
pipe-filling log volume. Lesson: **grep this
pattern first when any multi-subproc tractor test
hangs under default pytest but passes with `-s`.**
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
subint_forkserver_backend
parent
eceed29d4a
commit
4106ba73ea
|
|
@ -229,3 +229,69 @@ Unlike asyncio, trio allows checkpoints in
|
|||
that does `await` can itself be cancelled (e.g.
|
||||
by nursery shutdown). Watch for cleanup code that
|
||||
assumes it will run to completion.
|
||||
|
||||
### Unbounded waits in cleanup paths
|
||||
|
||||
Any `await <event>.wait()` in a teardown path is
|
||||
a latent deadlock unless the event's setter is
|
||||
GUARANTEED to fire. If the setter depends on
|
||||
external state (peer disconnects, child process
|
||||
exit, subsequent task completion) that itself
|
||||
depends on the current task's progress, you have
|
||||
a mutual wait.
|
||||
|
||||
Rule: **bound every `await X.wait()` in cleanup
|
||||
paths with `trio.move_on_after()`** unless you
|
||||
can prove the setter is unconditionally reachable
|
||||
from the state at the await site. Concrete recent
|
||||
example: `ipc_server.wait_for_no_more_peers()` in
|
||||
`async_main`'s finally (see
|
||||
`ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md`
|
||||
"probe iteration 3") — it was unbounded, and when
|
||||
one peer-handler was stuck the wait-for-no-more-
|
||||
peers event never fired, deadlocking the whole
|
||||
actor-tree teardown cascade.
|
||||
|
||||
### The capture-pipe-fill hang pattern (grep this first)
|
||||
|
||||
When investigating any hang in the test suite
|
||||
**especially under fork-based backends**, first
|
||||
check whether the hang reproduces under `pytest
|
||||
-s` (`--capture=no`). If `-s` makes it go away
|
||||
you're not looking at a trio concurrency bug —
|
||||
you're looking at a Linux pipe-buffer fill.
|
||||
|
||||
Mechanism: pytest replaces fds 1,2 with pipe
|
||||
write-ends. Fork-child subactors inherit those
|
||||
fds. High-volume error-log tracebacks (cancel
|
||||
cascade spew) fill the 64KB pipe buffer. Child
|
||||
`write()` blocks. Child can't exit. Parent's
|
||||
`waitpid`/pidfd wait blocks. Deadlock cascades up
|
||||
the tree.
|
||||
|
||||
Pre-existing guards in `tests/conftest.py` encode
|
||||
this knowledge — grep these BEFORE blaming
|
||||
concurrency:
|
||||
|
||||
```python
|
||||
# tests/conftest.py:258
|
||||
if loglevel in ('trace', 'debug'):
|
||||
# XXX: too much logging will lock up the subproc (smh)
|
||||
loglevel: str = 'info'
|
||||
|
||||
# tests/conftest.py:316
|
||||
# can lock up on the `_io.BufferedReader` and hang..
|
||||
stderr: str = proc.stderr.read().decode()
|
||||
```
|
||||
|
||||
Full post-mortem +
|
||||
`ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md`
|
||||
for the canonical reproduction. Cost several
|
||||
investigation sessions before catching it —
|
||||
because the capture-pipe symptom was masked by
|
||||
deeper cascade-deadlocks. Once the cascades were
|
||||
fixed, the tree tore down enough to generate
|
||||
pipe-filling log volume → capture-pipe finally
|
||||
surfaced. Grep-note for future-self: **if a
|
||||
multi-subproc tractor test hangs, `pytest -s`
|
||||
first, conc-anal second.**
|
||||
|
|
|
|||
|
|
@ -451,3 +451,73 @@ by your changes — note them and move on.
|
|||
**Rule of thumb**: if a test fails with `TooSlowError`,
|
||||
`trio.TooSlowError`, or `pexpect.TIMEOUT` and you didn't
|
||||
touch the relevant code path, it's flaky — skip it.
|
||||
|
||||
## 9. The pytest-capture hang pattern (CHECK THIS FIRST)
|
||||
|
||||
**Symptom:** a tractor test hangs indefinitely under
|
||||
default `pytest` but passes instantly when you add
|
||||
`-s` (`--capture=no`).
|
||||
|
||||
**Cause:** tractor subactors (especially under fork-
|
||||
based backends) inherit pytest's stdout/stderr
|
||||
capture pipes via fds 1,2. Under high-volume error
|
||||
logging (e.g. multi-level cancel cascade, nested
|
||||
`run_in_actor` failures, anything triggering
|
||||
`RemoteActorError` + `ExceptionGroup` traceback
|
||||
spew), the **64KB Linux pipe buffer fills** faster
|
||||
than pytest drains it. Subactor writes block → can't
|
||||
finish exit → parent's `waitpid`/pidfd wait blocks →
|
||||
deadlock cascades up the tree.
|
||||
|
||||
**Pre-existing guards in the tractor harness** that
|
||||
encode this same knowledge — grep these FIRST
|
||||
before spelunking:
|
||||
|
||||
- `tests/conftest.py:258-260` (in the `daemon`
|
||||
fixture): `# XXX: too much logging will lock up
|
||||
the subproc (smh)` — downgrades `trace`/`debug`
|
||||
loglevel to `info` to prevent the hang.
|
||||
- `tests/conftest.py:316`: `# can lock up on the
|
||||
_io.BufferedReader and hang..` — noted on the
|
||||
`proc.stderr.read()` post-SIGINT.
|
||||
|
||||
**Debug recipe (in priority order):**
|
||||
|
||||
1. **Try `-s` first.** If the hang disappears with
|
||||
`pytest -s`, you've confirmed it's capture-pipe
|
||||
fill. Skip spelunking.
|
||||
2. **Lower the loglevel.** Default `--ll=error` on
|
||||
this project; if you've bumped it to `debug` /
|
||||
`info`, try dropping back. Each log level
|
||||
multiplies pipe-pressure under fault cascades.
|
||||
3. **If you MUST use default capture + high log
|
||||
volume**, redirect subactor stdout/stderr in the
|
||||
child prelude (e.g.
|
||||
`tractor.spawn._subint_forkserver._child_target`
|
||||
post-`_close_inherited_fds`) to `/dev/null` or a
|
||||
file.
|
||||
|
||||
**Signature tells you it's THIS bug (vs. a real
|
||||
code hang):**
|
||||
|
||||
- Multi-actor test under fork-based backend
|
||||
(`subint_forkserver`, eventually `trio_proc` too
|
||||
under enough log volume).
|
||||
- Multiple `RemoteActorError` / `ExceptionGroup`
|
||||
tracebacks in the error path.
|
||||
- Test passes with `-s` in the 5-10s range, hangs
|
||||
past pytest-timeout (usually 30+ s) without `-s`.
|
||||
- Subactor processes visible via `pgrep -af
|
||||
subint-forkserv` or similar after the hang —
|
||||
they're alive but blocked on `write()` to an
|
||||
inherited stdout fd.
|
||||
|
||||
**Historical reference:** this deadlock cost a
|
||||
multi-session investigation (4 genuine cascade
|
||||
fixes landed along the way) that only surfaced the
|
||||
capture-pipe issue AFTER the deeper fixes let the
|
||||
tree actually tear down enough to produce pipe-
|
||||
filling log volume. Full post-mortem in
|
||||
`ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md`.
|
||||
Lesson codified here so future-me grep-finds the
|
||||
workaround before digging.
|
||||
|
|
|
|||
Loading…
Reference in New Issue