From 4d0555435b549dbf15ec8b47650255db9bac29ad Mon Sep 17 00:00:00 2001 From: goodboy Date: Thu, 23 Apr 2026 21:36:19 -0400 Subject: [PATCH] Narrow forkserver hang to `async_main` outer tn MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Fourth diagnostic pass — instrument `_worker`'s fork-child branch (`pre child_target()` / `child_ target RETURNED rc=N` / `about to os._exit(rc)`) and `_trio_main` boundaries (`about to trio.run` / `trio.run RETURNED NORMALLY` / `FINALLY`). Test config: depth=1/breadth=2 = 1 root + 14 forked = 15 actors total. Fresh-run results, - **9 processes complete the full flow**: `trio.run RETURNED NORMALLY` → `child_target RETURNED rc=0` → `os._exit(0)`. These are tree LEAVES (errorers) plus their direct parents (depth-0 spawners) — they actually exit - **5 processes stuck INSIDE `trio.run(trio_ main)`**: hit "about to trio.run" but never see "trio.run RETURNED NORMALLY". These are root + top-level spawners + one intermediate The deadlock is in `async_main` itself, NOT the peer-channel loops. Specifically, the outer `async with root_tn:` in `async_main` never exits for the 5 stuck actors, so the cascade wedges: trio.run never returns → _trio_main finally never runs → _worker never reaches os._exit(rc) → process never dies → parent's _ForkedProc.wait() blocks → parent's nursery hangs → parent's async_main hangs → (recurse up) The precise new question: **what task in the 5 stuck actors' `async_main` never completes?** Candidates: 1. shielded parent-chan `process_messages` task in `root_tn` — but we cancel it via `_parent_chan_cs.cancel()` in `Actor.cancel()`, which only runs during `open_root_actor.__aexit__`, which itself runs only after `async_main`'s outer unwind — which doesn't happen. So the shield isn't broken in this path. 2. `actor_nursery._join_procs.wait()` or similar inline in the backend `*_proc` flow. 3. `_ForkedProc.wait()` on a grandchild that DID exit — but pidfd_open watch didn't fire (race between `pidfd_open` and the child exiting?). Most specific next probe: add DIAG around `_ForkedProc.wait()` enter/exit to see whether pidfd-based wait returns for every grandchild exit. If a stuck parent's `_ForkedProc.wait()` never returns despite its child exiting → pidfd mechanism has a race bug under nested forkserver. Asymmetry observed in the cascade tree: some d=0 spawners exit cleanly, others stick, even though they started identically. Not purely depth- determined — some race condition in nursery teardown when multiple siblings error simultaneously. No code changes — diagnosis-only. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code --- ...forkserver_test_cancellation_leak_issue.md | 95 +++++++++++++++++++ 1 file changed, 95 insertions(+) diff --git a/ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md b/ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md index 66e51bab..0a1c4f5b 100644 --- a/ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md +++ b/ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md @@ -540,6 +540,101 @@ their owning actor's `Actor.cancel()` never runs. The recvs are fine — they're just parked because nothing is telling them to stop. +## Update — 2026-04-23 (very late): leaves exit, middle actors stuck in `trio.run` + +Yet another instrumentation pass — this time +printing at: + +- `_worker` child branch: `pre child_target()` / + `child_target RETURNED rc=N` / `about to + os._exit(rc)` +- `_trio_main`: `about to trio.run` / + `trio.run RETURNED NORMALLY` / `FINALLY` + +**Fresh-run results** (`test_nested_multierrors[ +subint_forkserver]`, depth=1/breadth=2, 1 root + 14 +forked = 15 actors total): + +- **9 processes completed the full flow** — + `trio.run RETURNED NORMALLY` → `child_target + RETURNED rc=0` → `about to os._exit(0)`. These + are the LEAVES of the tree (errorer actors) plus + their direct parents (depth-0 spawners). They + actually exit their processes. +- **5 processes are stuck INSIDE `trio.run(trio_main)`** + — they hit "about to trio.run" but NEVER see + "trio.run RETURNED NORMALLY". These are root + + top-level spawners + one intermediate. + +**What this means:** `async_main` itself is the +deadlock holder, not the peer-channel loops. +Specifically, the outer `async with root_tn:` in +`async_main` never exits for the 5 stuck actors. +Their `trio.run` never returns → `_trio_main` +catch/finally never runs → `_worker` never reaches +`os._exit(rc)` → the PROCESS never dies → its +parent's `_ForkedProc.wait()` blocks → parent's +nursery hangs → parent's `async_main` hangs → ... + +### The new precise question + +**What task in the 5 stuck actors' `async_main` +never completes?** Candidates: + +1. The shielded parent-chan `process_messages` + task in `root_tn` — but we explicitly cancel it + via `_parent_chan_cs.cancel()` in `Actor.cancel()`. + However, `Actor.cancel()` only runs during + `open_root_actor.__aexit__`, which itself runs + only after `async_main`'s outer unwind — which + doesn't happen. So the shield isn't broken. + +2. `await actor_nursery._join_procs.wait()` or + similar in the inline backend `*_proc` flow. + +3. `_ForkedProc.wait()` on a grandchild that + actually DID exit — but the pidfd_open watch + didn't fire for some reason (race between + pidfd_open and the child exiting?). + +The most specific next probe: **add DIAG around +`_ForkedProc.wait()` enter/exit** to see whether +the pidfd-based wait returns for every grandchild +exit. If a stuck parent's `_ForkedProc.wait()` +NEVER returns despite its child exiting, the +pidfd mechanism has a race bug under nested +forkserver. + +Alternative probe: instrument `async_main`'s outer +nursery exits to find which nursery's `__aexit__` +is stuck, drilling down from `trio.run` to the +specific `async with` that never completes. + +### Cascade summary (updated tree view) + +``` +ROOT (pytest) STUCK in trio.run +├── top_0 (spawner, d=1) STUCK in trio.run +│ ├── spawner_0_d1_0 (d=0) exited (os._exit 0) +│ │ ├── errorer_0_0 exited (os._exit 0) +│ │ └── errorer_0_1 exited (os._exit 0) +│ └── spawner_0_d1_1 (d=0) exited (os._exit 0) +│ ├── errorer_0_2 exited (os._exit 0) +│ └── errorer_0_3 exited (os._exit 0) +└── top_1 (spawner, d=1) STUCK in trio.run + ├── spawner_1_d1_0 (d=0) STUCK in trio.run (sibling race?) + │ ├── errorer_1_0 exited + │ └── errorer_1_1 exited + └── spawner_1_d1_1 (d=0) STUCK in trio.run + ├── errorer_1_2 exited + └── errorer_1_3 exited +``` + +Grandchildren (d=0 spawners) exit OR stick — +asymmetric. Not purely depth-determined. Some race +condition in nursery teardown when multiple +siblings error simultaneously. + ## Stopgap (landed) `test_nested_multierrors` skip-marked under