From ab86f7613de2a438cad64ff23829e75c809ea473 Mon Sep 17 00:00:00 2001
From: goodboy <goodboy_foss@protonmail.com>
Date: Thu, 23 Apr 2026 21:23:11 -0400
Subject: [PATCH] Refine `subint_forkserver` cancel-cascade diag
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Third diagnostic pass on
`test_nested_multierrors[subint_forkserver]` hang.
Two prior hypotheses ruled out + a new, more
specific deadlock shape identified.

Ruled out,
- **capture-pipe fill** (`-s` flag changes test):
  retested explicitly — `test_nested_multierrors`
  hangs identically with and without `-s`. The
  earlier observation was likely a competing
  pytest process I had running in another session
  holding registry state
- **stuck peer-chan recv that cancel can't
  break**: pivot from the prior pass. With
  `handle_stream_from_peer` instrumented at ENTER
  / `except trio.Cancelled:` / finally: 40
  ENTERs, ZERO `trio.Cancelled` hits. Cancel never
  reaches those tasks at all — the recvs are
  fine, nothing is telling them to stop

Actual deadlock shape: multi-level mutual wait.

    root              blocks on spawner.wait()
      spawner         blocks on grandchild.wait()
        grandchild    blocks on errorer.wait()
          errorer     Actor.cancel() ran, but proc
                      never exits

`Actor.cancel()` fired in 12 PIDs — but NOT in
root + 2 direct spawners. Those 3 have peer
handlers stuck because their own `Actor.cancel()`
never runs, which only runs when the enclosing
`tractor.open_nursery()` exits, which waits on
`_ForkedProc.wait()` for the child pidfd to
signal, which only signals when the child
process fully exits.

Refined question: **why does an errorer process
not exit after its `Actor.cancel()` completes?**
Three hypotheses (unverified):
1. `_parent_chan_cs.cancel()` fires but the
   shielded loop's recv is stuck in a way cancel
   still can't break
2. `async_main`'s post-cancel unwind has other
   tasks in `root_tn` awaiting something that
   never arrives (e.g. outbound IPC reply)
3. `os._exit(rc)` in `_worker` never runs because
   `_child_target` never returns

Next-session probes (priority order):
1. instrument `_worker`'s fork-child branch —
   confirm whether `child_target()` returns /
   `os._exit(rc)` is reached for errorer PIDs
2. instrument `async_main`'s final unwind — see
   which await in teardown doesn't complete
3. compare under `trio_proc` backend at the
   equivalent level to spot divergence

No code changes — diagnosis-only.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
---
 ...forkserver_test_cancellation_leak_issue.md | 123 +++++++++++++++---
 1 file changed, 106 insertions(+), 17 deletions(-)

diff --git a/ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md b/ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md
index f273a304..66e51bab 100644
--- a/ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md
+++ b/ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md
@@ -431,25 +431,114 @@ Either way, the sync-close hypothesis is **ruled
 out**. Reverted the experiment, restored the skip-
 mark on the test.
 
-### Aside: `-s` flag changes behavior for peer-intensive tests
+### Aside: `-s` flag does NOT change `test_nested_multierrors` behavior
 
-While exploring, noticed
-`tests/test_context_stream_semantics.py` under
-`--spawn-backend=subint_forkserver` hangs with
-pytest's default `--capture=fd` but passes with
-`-s` (`--capture=no`). Hypothesis (unverified): fork
-children inherit pytest's capture pipe for stdout/
-stderr (fds 1,2 — we preserve these in
-`_close_inherited_fds`). When subactor logging is
-verbose, the capture pipe buffer fills, writes block,
-child can't progress, deadlock.
+Tested explicitly: both with and without `-s`, the
+test hangs identically. So the capture-pipe-fill
+hypothesis is **ruled out** for this test.
 
-If confirmed, fix direction: redirect subactor
-stdout/stderr to `/dev/null` (or a file) in
-`_actor_child_main` so subactors don't hold pytest's
-capture pipe open. Not a blocker on the main
-peer-chan-loop investigation; deserves its own mini-
-tracker.
+The earlier `test_context_stream_semantics.py` `-s`
+observation was most likely caused by a competing
+pytest run in my session (confirmed via process list
+— my leftover pytest was alive at that time and
+could have been holding state on the default
+registry port).
+
+## Update — 2026-04-23 (late): cancel delivery ruled in, nursery-wait ruled BLOCKER
+
+**New diagnostic run** instrumented
+`handle_stream_from_peer` at ENTER / `except
+trio.Cancelled:` / finally, plus `Actor.cancel()`
+just before `self._parent_chan_cs.cancel()`. Result:
+
+- **40 `handle_stream_from_peer` ENTERs**.
+- **0 `except trio.Cancelled:` hits** — cancel
+  never fires on any peer-handler.
+- **35 finally hits** — those handlers exit via
+  peer-initiated EOF (normal return), NOT cancel.
+- **5 handlers never reach finally** — stuck forever.
+- **`Actor.cancel()` fired in 12 PIDs** — but the
+  PIDs with peer handlers that DIDN'T fire
+  Actor.cancel are exactly **root + 2 direct
+  spawners**. These 3 actors have peer handlers
+  (for their own subactors) that stay stuck because
+  **`Actor.cancel()` at these levels never runs**.
+
+### The actual deadlock shape
+
+`Actor.cancel()` lives in
+`open_root_actor.__aexit__` / `async_main` teardown.
+That only runs when the enclosing `async with
+tractor.open_nursery()` exits. The nursery's
+`__aexit__` calls the backend `*_proc` spawn target's
+teardown, which does `soft_kill() →
+_ForkedProc.wait()` on its child PID. That wait is
+trio-cancellable via pidfd now (good) — but nothing
+CANCELS it because the outer scope only cancels when
+`Actor.cancel()` runs, which only runs when the
+nursery completes, which waits on the child.
+
+It's a **multi-level mutual wait**:
+
+```
+root              blocks on spawner.wait()
+  spawner         blocks on grandchild.wait()
+    grandchild    blocks on errorer.wait()
+      errorer     Actor.cancel() ran, but process
+                  may not have fully exited yet
+                  (something in root_tn holding on?)
+```
+
+Each level waits for the level below. The bottom
+level (errorer) reaches Actor.cancel(), but its
+process may not fully exit — meaning its pidfd
+doesn't go readable, meaning the grandchild's
+waitpid doesn't return, meaning the grandchild's
+nursery doesn't unwind, etc. all the way up.
+
+### Refined question
+
+**Why does an errorer process not exit after its
+`Actor.cancel()` completes?**
+
+Possibilities:
+1. `_parent_chan_cs.cancel()` fires (shielded
+   parent-chan loop unshielded), but the task is
+   stuck INSIDE the shielded loop's recv in a way
+   that cancel still can't break.
+2. After `Actor.cancel()` returns, `async_main`
+   still has other tasks in `root_tn` waiting for
+   something that never arrives (e.g. outbound
+   IPC reply delivery).
+3. The `os._exit(rc)` in `_worker` (at
+   `_subint_forkserver.py`) doesn't run because
+   `_child_target` never returns.
+
+Next-session candidate probes (in priority order):
+
+1. **Instrument `_worker`'s fork-child branch** to
+   confirm whether `child_target()` returns (and
+   thus `os._exit(rc)` is reached) for errorer
+   PIDs. If yes → process should die; if no →
+   trace back into `_actor_child_main` /
+   `_trio_main` / `async_main` to find the stuck
+   spot.
+2. **Instrument `async_main`'s final unwind** to
+   see which await in the teardown doesn't
+   complete.
+3. **Compare under `trio_proc` backend** at the
+   same `_worker`-equivalent level to see where
+   the flows diverge.
+
+### Rule-out: NOT a stuck peer-chan recv
+
+Earlier hypothesis was that the 5 stuck peer-chan
+loops were blocked on a socket recv that cancel
+couldn't interrupt. This pass revealed the real
+cause: cancel **never reaches those tasks** because
+their owning actor's `Actor.cancel()` never runs.
+The recvs are fine — they're just parked because
+nothing is telling them to stop.
 
 ## Stopgap (landed)