Narrow forkserver hang to `async_main` outer tn

Fourth diagnostic pass — instrument `_worker`'s fork-child branch (`pre child_target()` / `child_ target RETURNED rc=N` / `about to os._exit(rc)`) and `_trio_main` boundaries (`about to trio.run` / `trio.run RETURNED NORMALLY` / `FINALLY`). Test config: depth=1/breadth=2 = 1 root + 14 forked = 15 actors total. Fresh-run results, - **9 processes complete the full flow**: `trio.run RETURNED NORMALLY` → `child_target RETURNED rc=0` → `os._exit(0)`. These are tree LEAVES (errorers) plus their direct parents (depth-0 spawners) — they actually exit - **5 processes stuck INSIDE `trio.run(trio_ main)`**: hit "about to trio.run" but never see "trio.run RETURNED NORMALLY". These are root + top-level spawners + one intermediate The deadlock is in `async_main` itself, NOT the peer-channel loops. Specifically, the outer `async with root_tn:` in `async_main` never exits for the 5 stuck actors, so the cascade wedges: trio.run never returns → _trio_main finally never runs → _worker never reaches os._exit(rc) → process never dies → parent's _ForkedProc.wait() blocks → parent's nursery hangs → parent's async_main hangs → (recurse up) The precise new question: **what task in the 5 stuck actors' `async_main` never completes?** Candidates: 1. shielded parent-chan `process_messages` task in `root_tn` — but we cancel it via `_parent_chan_cs.cancel()` in `Actor.cancel()`, which only runs during `open_root_actor.__aexit__`, which itself runs only after `async_main`'s outer unwind — which doesn't happen. So the shield isn't broken in this path. 2. `actor_nursery._join_procs.wait()` or similar inline in the backend `*_proc` flow. 3. `_ForkedProc.wait()` on a grandchild that DID exit — but pidfd_open watch didn't fire (race between `pidfd_open` and the child exiting?). Most specific next probe: add DIAG around `_ForkedProc.wait()` enter/exit to see whether pidfd-based wait returns for every grandchild exit. If a stuck parent's `_ForkedProc.wait()` never returns despite its child exiting → pidfd mechanism has a race bug under nested forkserver. Asymmetry observed in the cascade tree: some d=0 spawners exit cleanly, others stick, even though they started identically. Not purely depth- determined — some race condition in nursery teardown when multiple siblings error simultaneously. No code changes — diagnosis-only. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-23 21:36:19 -04:00 · 2026-04-23 21:36:19 -04:00 · 4d0555435b
parent ab86f7613d
commit 4d0555435b
1 changed files with 95 additions and 0 deletions
--- a/ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md
+++ b/ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md
@ -540,6 +540,101 @@ their owning actor's `Actor.cancel()` never runs.
 The recvs are fine — they're just parked because
 nothing is telling them to stop.

+## Update — 2026-04-23 (very late): leaves exit, middle actors stuck in `trio.run`
+
+Yet another instrumentation pass — this time
+printing at:
+
+- `_worker` child branch: `pre child_target()` /
+  `child_target RETURNED rc=N` / `about to
+  os._exit(rc)` 
+- `_trio_main`: `about to trio.run` /
+  `trio.run RETURNED NORMALLY` / `FINALLY`
+
+**Fresh-run results** (`test_nested_multierrors[
+subint_forkserver]`, depth=1/breadth=2, 1 root + 14
+forked = 15 actors total):
+
+- **9 processes completed the full flow** —
+  `trio.run RETURNED NORMALLY` → `child_target
+  RETURNED rc=0` → `about to os._exit(0)`. These
+  are the LEAVES of the tree (errorer actors) plus
+  their direct parents (depth-0 spawners). They
+  actually exit their processes.
+- **5 processes are stuck INSIDE `trio.run(trio_main)`**
+  — they hit "about to trio.run" but NEVER see
+  "trio.run RETURNED NORMALLY". These are root +
+  top-level spawners + one intermediate.
+
+**What this means:** `async_main` itself is the
+deadlock holder, not the peer-channel loops.
+Specifically, the outer `async with root_tn:` in
+`async_main` never exits for the 5 stuck actors.
+Their `trio.run` never returns → `_trio_main`
+catch/finally never runs → `_worker` never reaches
+`os._exit(rc)` → the PROCESS never dies → its
+parent's `_ForkedProc.wait()` blocks → parent's
+nursery hangs → parent's `async_main` hangs → ...
+
+### The new precise question
+
+**What task in the 5 stuck actors' `async_main`
+never completes?** Candidates:
+
+1. The shielded parent-chan `process_messages`
+   task in `root_tn` — but we explicitly cancel it
+   via `_parent_chan_cs.cancel()` in `Actor.cancel()`.
+   However, `Actor.cancel()` only runs during
+   `open_root_actor.__aexit__`, which itself runs
+   only after `async_main`'s outer unwind — which
+   doesn't happen. So the shield isn't broken.
+
+2. `await actor_nursery._join_procs.wait()` or
+   similar in the inline backend `*_proc` flow.
+
+3. `_ForkedProc.wait()` on a grandchild that
+   actually DID exit — but the pidfd_open watch
+   didn't fire for some reason (race between
+   pidfd_open and the child exiting?).
+
+The most specific next probe: **add DIAG around
+`_ForkedProc.wait()` enter/exit** to see whether
+the pidfd-based wait returns for every grandchild
+exit. If a stuck parent's `_ForkedProc.wait()`
+NEVER returns despite its child exiting, the
+pidfd mechanism has a race bug under nested
+forkserver.
+
+Alternative probe: instrument `async_main`'s outer
+nursery exits to find which nursery's `__aexit__`
+is stuck, drilling down from `trio.run` to the
+specific `async with` that never completes.
+
+### Cascade summary (updated tree view)
+
+```
+ROOT (pytest)                       STUCK in trio.run
+├── top_0 (spawner, d=1)            STUCK in trio.run
+│   ├── spawner_0_d1_0 (d=0)        exited (os._exit 0)
+│   │   ├── errorer_0_0             exited (os._exit 0)
+│   │   └── errorer_0_1             exited (os._exit 0)
+│   └── spawner_0_d1_1 (d=0)        exited (os._exit 0)
+│       ├── errorer_0_2             exited (os._exit 0)
+│       └── errorer_0_3             exited (os._exit 0)
+└── top_1 (spawner, d=1)            STUCK in trio.run
+    ├── spawner_1_d1_0 (d=0)        STUCK in trio.run (sibling race?)
+    │   ├── errorer_1_0             exited
+    │   └── errorer_1_1             exited
+    └── spawner_1_d1_1 (d=0)        STUCK in trio.run
+        ├── errorer_1_2             exited
+        └── errorer_1_3             exited
+```
+
+Grandchildren (d=0 spawners) exit OR stick —
+asymmetric. Not purely depth-determined. Some race
+condition in nursery teardown when multiple
+siblings error simultaneously.
+
 ## Stopgap (landed)

 `test_nested_multierrors` skip-marked under