Claude-perms: ensure /commit-msg files can be written!

Skip-mark + narrow `subint_forkserver` cancel hang
Two-part stopgap for the still-hanging `test_nested_multierrors[subint_forkserver]`: 1. Skip-mark the test via `@pytest.mark.skipon_spawn_backend('subint_forkserver', reason=...)` so it stops blocking the test matrix while the remaining bug is being chased. The reason string cross-refs the conc-anal doc for full context. 2. Update the conc-anal doc (`subint_forkserver_test_cancellation_leak_issue.md`) with the empirical state after the three nested- cancel fix commits (`0cd0b633` FD scrub + `fe540d02` pidfd wait + `57935804` parent-chan shield break) landed, narrowing the remaining hang from "everything broken" to "peer-channel loops don't exit on `service_tn` cancel". Deats from the DIAGDEBUG instrumentation pass, - 80 `process_messages` ENTERs, 75 EXITs → 5 stuck - ALL 40 `shield=True` ENTERs matched EXIT — the `_parent_chan_cs.cancel()` wiring from `57935804` works as intended for shielded loops. - the 5 stuck loops are all `shield=False` peer- channel handlers in `handle_stream_from_peer` (inbound connections handled by `stream_handler_tn`, which IS `service_tn` in the current config). - after `_parent_chan_cs.cancel()` fires, NEW shielded loops appear on the session reg_addr port — probably discovery-layer reconnection; doesn't block teardown but indicates the cascade has more moving parts than expected. The remaining unknown: why don't the 5 peer-channel loops exit when `service_tn.cancel_scope.cancel()` fires? They're not shielded, they're inside the service_tn scope, a standard cancel should propagate through. Some fork-config-specific divergence keeps them alive. Doc lists three follow-up experiments (stackscope dump, side-by-side `trio_proc` comparison, audit of the `tractor/ipc/_server.py:448` `except trio.Cancelled:` path). (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-23 16:48:06 -04:00 · 2026-04-23 16:44:15 -04:00 · 2026-04-23 16:27:38 -04:00 · 2026-04-23 16:06:45 -04:00 · 2026-04-23 15:30:39 -04:00 · 2026-04-23 15:21:41 -04:00
10 changed files with 25 additions and 636 deletions
--- a/.claude/skills/conc-anal/SKILL.md
+++ b/.claude/skills/conc-anal/SKILL.md
@ -229,69 +229,3 @@ Unlike asyncio, trio allows checkpoints in
 that does `await` can itself be cancelled (e.g.
 by nursery shutdown). Watch for cleanup code that
 assumes it will run to completion.
-
-### Unbounded waits in cleanup paths
-
-Any `await <event>.wait()` in a teardown path is
-a latent deadlock unless the event's setter is
-GUARANTEED to fire. If the setter depends on
-external state (peer disconnects, child process
-exit, subsequent task completion) that itself
-depends on the current task's progress, you have
-a mutual wait.
-
-Rule: **bound every `await X.wait()` in cleanup
-paths with `trio.move_on_after()`** unless you
-can prove the setter is unconditionally reachable
-from the state at the await site. Concrete recent
-example: `ipc_server.wait_for_no_more_peers()` in
-`async_main`'s finally (see
-`ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md`
-"probe iteration 3") — it was unbounded, and when
-one peer-handler was stuck the wait-for-no-more-
-peers event never fired, deadlocking the whole
-actor-tree teardown cascade.
-
-### The capture-pipe-fill hang pattern (grep this first)
-
-When investigating any hang in the test suite
-**especially under fork-based backends**, first
-check whether the hang reproduces under `pytest
-s` (`--capture=no`). If `-s` makes it go away
-you're not looking at a trio concurrency bug —
-you're looking at a Linux pipe-buffer fill.
-
-Mechanism: pytest replaces fds 1,2 with pipe
-write-ends. Fork-child subactors inherit those
-fds. High-volume error-log tracebacks (cancel
-cascade spew) fill the 64KB pipe buffer. Child
-`write()` blocks. Child can't exit. Parent's
-`waitpid`/pidfd wait blocks. Deadlock cascades up
-the tree.
-
-Pre-existing guards in `tests/conftest.py` encode
-this knowledge — grep these BEFORE blaming
-concurrency:
-
-```python
-# tests/conftest.py:258
-if loglevel in ('trace', 'debug'):
-    # XXX: too much logging will lock up the subproc (smh)
-    loglevel: str = 'info'
-
-# tests/conftest.py:316
-# can lock up on the `_io.BufferedReader` and hang..
-stderr: str = proc.stderr.read().decode()
-```
-
-Full post-mortem +
-`ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md`
-for the canonical reproduction. Cost several
-investigation sessions before catching it —
-because the capture-pipe symptom was masked by
-deeper cascade-deadlocks. Once the cascades were
-fixed, the tree tore down enough to generate
-pipe-filling log volume → capture-pipe finally
-surfaced. Grep-note for future-self: **if a
-multi-subproc tractor test hangs, `pytest -s`
-first, conc-anal second.**
--- a/.claude/skills/run-tests/SKILL.md
+++ b/.claude/skills/run-tests/SKILL.md
@ -451,73 +451,3 @@ by your changes — note them and move on.
 **Rule of thumb**: if a test fails with `TooSlowError`,
 `trio.TooSlowError`, or `pexpect.TIMEOUT` and you didn't
 touch the relevant code path, it's flaky — skip it.
-
-## 9. The pytest-capture hang pattern (CHECK THIS FIRST)
-
-**Symptom:** a tractor test hangs indefinitely under
-default `pytest` but passes instantly when you add
-`-s` (`--capture=no`).
-
-**Cause:** tractor subactors (especially under fork-
-based backends) inherit pytest's stdout/stderr
-capture pipes via fds 1,2. Under high-volume error
-logging (e.g. multi-level cancel cascade, nested
-`run_in_actor` failures, anything triggering
-`RemoteActorError` + `ExceptionGroup` traceback
-spew), the **64KB Linux pipe buffer fills** faster
-than pytest drains it. Subactor writes block → can't
-finish exit → parent's `waitpid`/pidfd wait blocks →
-deadlock cascades up the tree.
-
-**Pre-existing guards in the tractor harness** that
-encode this same knowledge — grep these FIRST
-before spelunking:
-
- `tests/conftest.py:258-260` (in the `daemon`
-  fixture): `# XXX: too much logging will lock up
-  the subproc (smh)` — downgrades `trace`/`debug`
-  loglevel to `info` to prevent the hang.
- `tests/conftest.py:316`: `# can lock up on the
-  _io.BufferedReader and hang..` — noted on the
-  `proc.stderr.read()` post-SIGINT.
-
-**Debug recipe (in priority order):**
-
-1. **Try `-s` first.** If the hang disappears with
-   `pytest -s`, you've confirmed it's capture-pipe
-   fill. Skip spelunking.
-2. **Lower the loglevel.** Default `--ll=error` on
-   this project; if you've bumped it to `debug` /
-   `info`, try dropping back. Each log level
-   multiplies pipe-pressure under fault cascades.
-3. **If you MUST use default capture + high log
-   volume**, redirect subactor stdout/stderr in the
-   child prelude (e.g.
-   `tractor.spawn._subint_forkserver._child_target`
-   post-`_close_inherited_fds`) to `/dev/null` or a
-   file.
-
-**Signature tells you it's THIS bug (vs. a real
-code hang):**
-
- Multi-actor test under fork-based backend
-  (`subint_forkserver`, eventually `trio_proc` too
-  under enough log volume).
- Multiple `RemoteActorError` / `ExceptionGroup`
-  tracebacks in the error path.
- Test passes with `-s` in the 5-10s range, hangs
-  past pytest-timeout (usually 30+ s) without `-s`.
- Subactor processes visible via `pgrep -af
-  subint-forkserv` or similar after the hang —
-  they're alive but blocked on `write()` to an
-  inherited stdout fd.
-
-**Historical reference:** this deadlock cost a
-multi-session investigation (4 genuine cascade
-fixes landed along the way) that only surfaced the
-capture-pipe issue AFTER the deeper fixes let the
-tree actually tear down enough to produce pipe-
-filling log volume. Full post-mortem in
-`ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md`.
-Lesson codified here so future-me grep-finds the
-workaround before digging.
--- a/ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md
+++ b/ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md
@ -395,432 +395,6 @@ Candidate follow-up experiments:
  re-raise means it should still exit. Unless
  something higher up swallows it.

-### Attempted fix (DID NOT work) — hypothesis (3)
-
-Tried: in `_serve_ipc_eps` finally, after closing
-listeners, also iterate `server._peers` and
-sync-close each peer channel's underlying stream
-socket fd:
-
-```python
-for _uid, _chans in list(server._peers.items()):
-    for _chan in _chans:
-        try:
-            _stream = _chan._transport.stream if _chan._transport else None
-            if _stream is not None:
-                _stream.socket.close()  # sync fd close
-        except (AttributeError, OSError):
-            pass
-```
-
-Theory: closing the socket fd from outside the stuck
-recv task would make the recv see EBADF /
-ClosedResourceError and unblock.
-
-Result: `test_nested_multierrors[subint_forkserver]`
-still hangs identically. Either:
- The sync `socket.close()` doesn't propagate into
-  trio's in-flight `recv_some()` the way I expected
-  (trio may hold an internal reference that keeps the
-  fd open even after an external close), or
- The stuck recv isn't even the root blocker and the
-  peer handlers never reach the finally for some
-  reason I haven't understood yet.
-
-Either way, the sync-close hypothesis is **ruled
-out**. Reverted the experiment, restored the skip-
-mark on the test.
-
-### Aside: `-s` flag does NOT change `test_nested_multierrors` behavior
-
-Tested explicitly: both with and without `-s`, the
-test hangs identically. So the capture-pipe-fill
-hypothesis is **ruled out** for this test.
-
-The earlier `test_context_stream_semantics.py` `-s`
-observation was most likely caused by a competing
-pytest run in my session (confirmed via process list
-— my leftover pytest was alive at that time and
-could have been holding state on the default
-registry port).
-
-## Update — 2026-04-23 (late): cancel delivery ruled in, nursery-wait ruled BLOCKER
-
-**New diagnostic run** instrumented
-`handle_stream_from_peer` at ENTER / `except
-trio.Cancelled:` / finally, plus `Actor.cancel()`
-just before `self._parent_chan_cs.cancel()`. Result:
-
- **40 `handle_stream_from_peer` ENTERs**.
- **0 `except trio.Cancelled:` hits** — cancel
-  never fires on any peer-handler.
- **35 finally hits** — those handlers exit via
-  peer-initiated EOF (normal return), NOT cancel.
- **5 handlers never reach finally** — stuck forever.
- **`Actor.cancel()` fired in 12 PIDs** — but the
-  PIDs with peer handlers that DIDN'T fire
-  Actor.cancel are exactly **root + 2 direct
-  spawners**. These 3 actors have peer handlers
-  (for their own subactors) that stay stuck because
-  **`Actor.cancel()` at these levels never runs**.
-
-### The actual deadlock shape
-
-`Actor.cancel()` lives in
-`open_root_actor.__aexit__` / `async_main` teardown.
-That only runs when the enclosing `async with
-tractor.open_nursery()` exits. The nursery's
-`__aexit__` calls the backend `*_proc` spawn target's
-teardown, which does `soft_kill() →
-_ForkedProc.wait()` on its child PID. That wait is
-trio-cancellable via pidfd now (good) — but nothing
-CANCELS it because the outer scope only cancels when
-`Actor.cancel()` runs, which only runs when the
-nursery completes, which waits on the child.
-
-It's a **multi-level mutual wait**:
-
-```
-root              blocks on spawner.wait()
-  spawner         blocks on grandchild.wait()
-    grandchild    blocks on errorer.wait()
-      errorer     Actor.cancel() ran, but process
-                  may not have fully exited yet
-                  (something in root_tn holding on?)
-```
-
-Each level waits for the level below. The bottom
-level (errorer) reaches Actor.cancel(), but its
-process may not fully exit — meaning its pidfd
-doesn't go readable, meaning the grandchild's
-waitpid doesn't return, meaning the grandchild's
-nursery doesn't unwind, etc. all the way up.
-
-### Refined question
-
-**Why does an errorer process not exit after its
-`Actor.cancel()` completes?**
-
-Possibilities:
-1. `_parent_chan_cs.cancel()` fires (shielded
-   parent-chan loop unshielded), but the task is
-   stuck INSIDE the shielded loop's recv in a way
-   that cancel still can't break.
-2. After `Actor.cancel()` returns, `async_main`
-   still has other tasks in `root_tn` waiting for
-   something that never arrives (e.g. outbound
-   IPC reply delivery).
-3. The `os._exit(rc)` in `_worker` (at
-   `_subint_forkserver.py`) doesn't run because
-   `_child_target` never returns.
-
-Next-session candidate probes (in priority order):
-
-1. **Instrument `_worker`'s fork-child branch** to
-   confirm whether `child_target()` returns (and
-   thus `os._exit(rc)` is reached) for errorer
-   PIDs. If yes → process should die; if no →
-   trace back into `_actor_child_main` /
-   `_trio_main` / `async_main` to find the stuck
-   spot.
-2. **Instrument `async_main`'s final unwind** to
-   see which await in the teardown doesn't
-   complete.
-3. **Compare under `trio_proc` backend** at the
-   same `_worker`-equivalent level to see where
-   the flows diverge.
-
-### Rule-out: NOT a stuck peer-chan recv
-
-Earlier hypothesis was that the 5 stuck peer-chan
-loops were blocked on a socket recv that cancel
-couldn't interrupt. This pass revealed the real
-cause: cancel **never reaches those tasks** because
-their owning actor's `Actor.cancel()` never runs.
-The recvs are fine — they're just parked because
-nothing is telling them to stop.
-
-## Update — 2026-04-23 (very late): leaves exit, middle actors stuck in `trio.run`
-
-Yet another instrumentation pass — this time
-printing at:
-
- `_worker` child branch: `pre child_target()` /
-  `child_target RETURNED rc=N` / `about to
-  os._exit(rc)` 
- `_trio_main`: `about to trio.run` /
-  `trio.run RETURNED NORMALLY` / `FINALLY`
-
-**Fresh-run results** (`test_nested_multierrors[
-subint_forkserver]`, depth=1/breadth=2, 1 root + 14
-forked = 15 actors total):
-
- **9 processes completed the full flow** —
-  `trio.run RETURNED NORMALLY` → `child_target
-  RETURNED rc=0` → `about to os._exit(0)`. These
-  are the LEAVES of the tree (errorer actors) plus
-  their direct parents (depth-0 spawners). They
-  actually exit their processes.
- **5 processes are stuck INSIDE `trio.run(trio_main)`**
-  — they hit "about to trio.run" but NEVER see
-  "trio.run RETURNED NORMALLY". These are root +
-  top-level spawners + one intermediate.
-
-**What this means:** `async_main` itself is the
-deadlock holder, not the peer-channel loops.
-Specifically, the outer `async with root_tn:` in
-`async_main` never exits for the 5 stuck actors.
-Their `trio.run` never returns → `_trio_main`
-catch/finally never runs → `_worker` never reaches
-`os._exit(rc)` → the PROCESS never dies → its
-parent's `_ForkedProc.wait()` blocks → parent's
-nursery hangs → parent's `async_main` hangs → ...
-
-### The new precise question
-
-**What task in the 5 stuck actors' `async_main`
-never completes?** Candidates:
-
-1. The shielded parent-chan `process_messages`
-   task in `root_tn` — but we explicitly cancel it
-   via `_parent_chan_cs.cancel()` in `Actor.cancel()`.
-   However, `Actor.cancel()` only runs during
-   `open_root_actor.__aexit__`, which itself runs
-   only after `async_main`'s outer unwind — which
-   doesn't happen. So the shield isn't broken.
-
-2. `await actor_nursery._join_procs.wait()` or
-   similar in the inline backend `*_proc` flow.
-
-3. `_ForkedProc.wait()` on a grandchild that
-   actually DID exit — but the pidfd_open watch
-   didn't fire for some reason (race between
-   pidfd_open and the child exiting?).
-
-The most specific next probe: **add DIAG around
-`_ForkedProc.wait()` enter/exit** to see whether
-the pidfd-based wait returns for every grandchild
-exit. If a stuck parent's `_ForkedProc.wait()`
-NEVER returns despite its child exiting, the
-pidfd mechanism has a race bug under nested
-forkserver.
-
-Alternative probe: instrument `async_main`'s outer
-nursery exits to find which nursery's `__aexit__`
-is stuck, drilling down from `trio.run` to the
-specific `async with` that never completes.
-
-### Cascade summary (updated tree view)
-
-```
-ROOT (pytest)                       STUCK in trio.run
-├── top_0 (spawner, d=1)            STUCK in trio.run
-│   ├── spawner_0_d1_0 (d=0)        exited (os._exit 0)
-│   │   ├── errorer_0_0             exited (os._exit 0)
-│   │   └── errorer_0_1             exited (os._exit 0)
-│   └── spawner_0_d1_1 (d=0)        exited (os._exit 0)
-│       ├── errorer_0_2             exited (os._exit 0)
-│       └── errorer_0_3             exited (os._exit 0)
-└── top_1 (spawner, d=1)            STUCK in trio.run
-    ├── spawner_1_d1_0 (d=0)        STUCK in trio.run (sibling race?)
-    │   ├── errorer_1_0             exited
-    │   └── errorer_1_1             exited
-    └── spawner_1_d1_1 (d=0)        STUCK in trio.run
-        ├── errorer_1_2             exited
-        └── errorer_1_3             exited
-```
-
-Grandchildren (d=0 spawners) exit OR stick —
-asymmetric. Not purely depth-determined. Some race
-condition in nursery teardown when multiple
-siblings error simultaneously.
-
-## Update — 2026-04-23 (late, probe iteration 3): hang pinpointed to `wait_for_no_more_peers()`
-
-Further DIAGDEBUG at every milestone in `async_main`
-(runtime UP / EXITED service_tn / EXITED root_tn /
-FINALLY ENTER / RETURNING) plus `_ForkedProc.wait`
-ENTER/RETURNED per-pidfd. Result:
-
-**Every stuck actor reaches `async_main: FINALLY
-ENTER` but NOT `async_main: RETURNING`.**
-
-That isolates the hang to a specific await in
-`async_main`'s finally block at
-`tractor/runtime/_runtime.py:1837+`. The suspect:
-
-```python
-# Ensure all peers (actors connected to us as clients) are finished
-if ipc_server := actor.ipc_server and ipc_server.has_peers(check_chans=True):
-    ...
-    await ipc_server.wait_for_no_more_peers()  # ← UNBOUNDED, blocks forever
-```
-
-`_no_more_peers` is an `Event` set only when
-`server._peers` empties (see
-`ipc/_server.py:526-530`). If ANY peer-handler is
-stuck (the 5 unclosed loops from the earlier pass),
-it keeps its channel in `server._peers`, so the
-event never fires, so the wait hangs.
-
-### Applied fix (partial, landed as defensive-in-depth)
-
-`tractor/runtime/_runtime.py:1981` —
-`wait_for_no_more_peers()` call now wrapped in
-`trio.move_on_after(3.0)` + a warning log when the
-timeout fires. Commented with the full rationale.
-
-**Verified:** with this fix, ALL 15 actors reach
-`async_main: RETURNING` cleanly (up from 10/15
-reaching end before).
-
-**Unfortunately:** the test still hangs past 45s
-total — meaning there's YET ANOTHER unbounded wait
-downstream of `async_main`. The bounded
-`wait_for_no_more_peers` unblocks one level, but
-the cascade has another level above it.
-
-### Candidates for the remaining hang
-
-1. `open_root_actor`'s own finally / post-
-   `async_main` flow in `_root.py` — specifically
-   `await actor.cancel(None)` which has its own
-   internal waits.
-2. The `trio.run()` itself doesn't return even
-   after the root task completes because trio's
-   nursery still has background tasks running.
-3. Maybe `_serve_ipc_eps`'s finally has an await
-   that blocks when peers aren't clearing.
-
-### Current stance
-
- Defensive `wait_for_no_more_peers` bound landed
-  (good hygiene regardless). Revealing a real
-  deadlock-avoidance gap in tractor's cleanup.
- Test still hangs → skip-mark restored on
-  `test_nested_multierrors[subint_forkserver]`.
- The full chain of unbounded waits needs another
-  session of drilling, probably at
-  `open_root_actor` / `actor.cancel` level.
-
-### Summary of this investigation's wins
-
-1. **FD hygiene fix** (`_close_inherited_fds`) —
-   correct, closed orphan-SIGINT sibling issue.
-2. **pidfd-based `_ForkedProc.wait`** — cancellable,
-   matches trio_proc pattern.
-3. **`_parent_chan_cs` wiring** —
-   `Actor.cancel()` now breaks the shielded parent-
-   chan `process_messages` loop.
-4. **`wait_for_no_more_peers` bounded** —
-   prevents the actor-level finally hang.
-5. **Ruled-out hypotheses:** tree-kill missing
-   (wrong), stuck socket recv (wrong).
-6. **Pinpointed remaining unknown:** at least one
-   more unbounded wait in the teardown cascade
-   above `async_main`. Concrete candidates
-   enumerated above.
-
-## Update — 2026-04-23 (VERY late): pytest capture pipe IS the final gate
-
-After landing fixes 1-4 and instrumenting every
-layer down to `tractor_test`'s `trio.run(_main)`:
-
-**Empirical result: with `pytest -s` the test PASSES
-in 6.20s.** Without `-s` (default `--capture=fd`) it
-hangs forever.
-
-DIAG timeline for the root pytest PID (with `-s`
-implied from later verification):
-
-```
-tractor_test: about to trio.run(_main)
-open_root_actor: async_main task started, yielding to test body
-_main: about to await wrapped test fn
-_main: wrapped RETURNED cleanly        ← test body completed!
-open_root_actor: about to actor.cancel(None)
-Actor.cancel ENTER req_chan=False
-Actor.cancel RETURN
-open_root_actor: actor.cancel RETURNED
-open_root_actor: outer FINALLY
-open_root_actor: finally END (returning from ctxmgr)
-tractor_test: trio.run FINALLY (returned or raised)  ← trio.run fully returned!
-```
-
-`trio.run()` fully returns. The test body itself
-completes successfully (pytest.raises absorbed the
-expected `BaseExceptionGroup`). What blocks is
-**pytest's own stdout/stderr capture** — under
-`--capture=fd` default, pytest replaces the parent
-process's fd 1,2 with pipe write-ends it's reading
-from. Fork children inherit those pipe fds
-(because `_close_inherited_fds` correctly preserves
-stdio). High-volume subactor error-log tracebacks
-(7+ actors each logging multiple
-`RemoteActorError`/`ExceptionGroup` tracebacks on
-the error-propagation cascade) fill the 64KB Linux
-pipe buffer. Subactor writes block. Subactor can't
-progress. Process doesn't exit. Parent's
-`_ForkedProc.wait` (now pidfd-based and
-cancellable, but nothing's cancelling here since
-the test body already completed) keeps the pipe
-reader alive... but pytest isn't draining its end
-fast enough because test-teardown/fixture-cleanup
-is in progress.
-
-**Actually** the exact mechanism is slightly
-different: pytest's capture fixture MIGHT be
-actively reading, but faster-than-writer subactors
-overflow its internal buffer. Or pytest might be
-blocked itself on the finalization step.
-
-Either way, `-s` conclusively fixes it.
-
-### Why I ruled this out earlier (and shouldn't have)
-
-Earlier in this investigation I tested
-`test_nested_multierrors` with/without `-s` and
-both hung. That's because AT THAT TIME, fixes 1-4
-weren't all in place yet. The test was hanging at
-multiple deeper levels long before reaching the
-"generate lots of error-log output" phase. Once
-the cascade actually tore down cleanly, enough
-output was produced to hit the capture-pipe limit.
-
-**Classic order-of-operations mistake in
-debugging:** ruling something out too early based
-on a test that was actually failing for a
-different reason.
-
-### Fix direction (next session)
-
-Redirect subactor stdout/stderr to `/dev/null` (or
-a session-scoped log file) in the fork-child
-prelude, right after `_close_inherited_fds()`. This
-severs the inherited pytest-capture pipes and lets
-subactor output flow elsewhere. Under normal
-production use (non-pytest), stdout/stderr would
-be the TTY — we'd want to keep that. So the
-redirect should be conditional or opt-in via the
-`child_sigint`/proc_kwargs flag family.
-
-Alternative: document as a gotcha and recommend
-`pytest -s` for any tests using the
-`subint_forkserver` backend with multi-level actor
-trees. Simpler, user-visible, no code change.
-
-### Current state
-
- Skip-mark on `test_nested_multierrors[subint_forkserver]`
-  restored with reason pointing here.
- Test confirmed passing with `-s` after all 4
-  cascade fixes applied.
- The 4 cascade fixes are NOT wasted — they're
-  correct hardening regardless of the capture-pipe
-  issue, AND without them we'd never reach the
-  "actually produces enough output to fill the
-  pipe" state.
-
 ## Stopgap (landed)

 `test_nested_multierrors` skip-marked under
--- a/ai/prompt-io/prompts/subints_spawner.md
+++ b/ai/prompt-io/prompts/subints_spawner.md
@ -24,7 +24,7 @@ Part of this work should include,
    is happening under the hood with how cpython implements subints.

  * default configuration should encourage state isolation as with
-    subprocs, but explicit public escape hatches to enable rigorously
+    with subprocs, but explicit public escape hatches to enable rigorously
    managed shm channels for high performance apps.

 - all tests should be (able to be) parameterized to use the new
--- a/tests/spawn/test_subint_forkserver.py
+++ b/tests/spawn/test_subint_forkserver.py
@ -446,23 +446,15 @@ def _process_alive(pid: int) -> bool:
        return False


-# Regressed back to xfail: previously passed after the
-# fork-child FD-hygiene fix in `_close_inherited_fds()`,
-# but the recent `wait_for_no_more_peers(move_on_after=3.0)`
-# bound in `async_main`'s teardown added up to 3s to the
-# orphan subactor's exit timeline, pushing it past the
-# test's 10s poll window. Real fix requires making the
-# bounded wait faster when the actor is orphaned, or
-# increasing the test's poll window. See tracker doc
-# `ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md`.
-@pytest.mark.xfail(
-    strict=True,
-    reason=(
-        'Regressed to xfail after `wait_for_no_more_peers` '
-        'bound added ~3s teardown latency. Needs either '
-        'faster orphan-side teardown or 15s test poll window.'
-    ),
-)
+# NOTE: was previously `@pytest.mark.xfail(strict=True, ...)`
+# for the orphan-SIGINT hang documented in
+# `ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md`
+# — now passes after the fork-child FD-hygiene fix in
+# `tractor.spawn._subint_forkserver._close_inherited_fds()`:
+# closing all inherited FDs (including the parent's IPC
+# listener + trio-epoll + wakeup-pipe FDs) lets the child's
+# trio event loop respond cleanly to external SIGINT.
+# Leaving the test in place as a regression guard.
@pytest.mark.timeout(
    30,
    method='thread',
--- a/tests/test_cancellation.py
+++ b/tests/test_cancellation.py
@ -455,16 +455,14 @@ async def spawn_and_error(
@pytest.mark.skipon_spawn_backend(
    'subint_forkserver',
    reason=(
-        'Passes cleanly with `pytest -s` (no stdout capture) '
-        'but hangs under default `--capture=fd` due to '
-        'pytest-capture-pipe buffer fill from high-volume '
-        'subactor error-log traceback output inherited via fds '
-        '1,2 in fork children. Fix direction: redirect subactor '
-        'stdout/stderr to `/dev/null` in `_child_target` / '
-        '`_actor_child_main` so forkserver children don\'t hold '
-        'pytest\'s capture pipe open. See `ai/conc-anal/'
+        'Multi-level fork-spawn cancel cascade hang — '
+        'peer-channel `process_messages` loops do not '
+        'exit on `service_tn.cancel_scope.cancel()`. '
+        'See `ai/conc-anal/'
        'subint_forkserver_test_cancellation_leak_issue.md` '
-        '"Update — pytest capture pipe is the final gate".'
+        'for the full diagnosis + candidate fix directions. '
+        'Drop this mark once the peer-chan-loop exit issue '
+        'is closed.'
    ),
 )
@pytest.mark.timeout(
--- a/tractor/runtime/_runtime.py
+++ b/tractor/runtime/_runtime.py
@ -1973,25 +1973,7 @@ async def async_main(
                f'   {pformat(ipc_server._peers)}'
            )
            log.runtime(teardown_report)
-            # NOTE: bound the peer-clear wait — otherwise if any
-            # peer-channel handler is stuck (e.g. never got its
-            # cancel propagated due to a runtime bug), this wait
-            # blocks forever and deadlocks the whole actor-tree
-            # teardown cascade. 3s is enough for any graceful
-            # cancel-ack round-trip; beyond that we're in bug
-            # territory and need to proceed with local teardown
-            # so the parent's `_ForkedProc.wait()` can unblock.
-            # See `ai/conc-anal/
-            # subint_forkserver_test_cancellation_leak_issue.md`
-            # for the full diagnosis.
-            with trio.move_on_after(3.0) as _peers_cs:
            await ipc_server.wait_for_no_more_peers()
-            if _peers_cs.cancelled_caught:
-                teardown_report += (
-                    f'-> TIMED OUT waiting for peers to clear '
-                    f'({len(ipc_server._peers)} still connected)\n'
-                )
-                log.warning(teardown_report)

        teardown_report += (
            '-]> all peer channels are complete.\n'
--- a/tractor/spawn/_subint_forkserver.py
+++ b/tractor/spawn/_subint_forkserver.py
@ -252,13 +252,9 @@ def _close_inherited_fds(
            os.close(fd)
            closed += 1
        except OSError:
-            # fd was already closed (race with listdir) or otherwise
-            # unclosable — either is fine.
-            log.exception(
-                f'Failed to close inherited fd in child ??\n'
-                f'{fd!r}\n'
-            )
-
+            # fd was already closed (race with listdir) or
+            # otherwise unclosable — either is fine.
+            pass
    return closed


@ -405,17 +401,11 @@ def fork_from_worker_thread(
        try:
            os.close(rfd)
        except OSError:
-            log.exception(
-                f'Failed to close PID-pipe read-fd in parent ??\n'
-                f'{rfd!r}\n'
-            )
+            pass
        try:
            os.close(wfd)
        except OSError:
-            log.exception(
-                f'Failed to close PID-pipe write-fd in parent ??\n'
-                f'{wfd!r}\n'
-            )
+            pass
        raise RuntimeError(
            f'subint-forkserver worker thread '
            f'{thread_name!r} did not return within '
@ -485,13 +475,6 @@ def run_subint_in_worker_thread(
            _interpreters.exec(interp_id, bootstrap)
        except BaseException as e:
            err = e
-            log.exception(
-                f'Failed to .exec() in subint ??\n'
-                f'_interpreters.exec(\n'
-                f'    interp_id={interp_id!r},\n'
-                f'    bootstrap={bootstrap!r},\n'
-                f') => {err!r}\n'
-            )

    worker: threading.Thread = threading.Thread(
        target=_drive,
--- a/tractor/spawn/_trio.py
+++ b/tractor/spawn/_trio.py
@ -246,11 +246,7 @@ async def trio_proc(
                            await proc.wait()

                await debug.maybe_wait_for_debugger(
-                    # NOTE: use the child's `_runtime_vars`
-                    # (the fn-arg dict shipped via `SpawnSpec`)
-                    # — NOT `get_runtime_vars()` which returns
-                    # the *parent's* live runtime state.
-                    child_in_debug=_runtime_vars.get(
+                    child_in_debug=get_runtime_vars().get(
                        '_debug_mode', False
                    ),
                    header_msg=(