Narrow forkserver hang to `async_main` outer tn

Fourth diagnostic pass — instrument `_worker`'s
fork-child branch (`pre child_target()` / `child_
target RETURNED rc=N` / `about to os._exit(rc)`)
and `_trio_main` boundaries (`about to trio.run` /
`trio.run RETURNED NORMALLY` / `FINALLY`). Test
config: depth=1/breadth=2 = 1 root + 14 forked =
15 actors total.

Fresh-run results,
- **9 processes complete the full flow**:
  `trio.run RETURNED NORMALLY` → `child_target
  RETURNED rc=0` → `os._exit(0)`. These are tree
  LEAVES (errorers) plus their direct parents
  (depth-0 spawners) — they actually exit
- **5 processes stuck INSIDE `trio.run(trio_
  main)`**: hit "about to trio.run" but never
  see "trio.run RETURNED NORMALLY". These are
  root + top-level spawners + one intermediate

The deadlock is in `async_main` itself, NOT the
peer-channel loops. Specifically, the outer
`async with root_tn:` in `async_main` never exits
for the 5 stuck actors, so the cascade wedges:

    trio.run never returns
      → _trio_main finally never runs
        → _worker never reaches os._exit(rc)
          → process never dies
            → parent's _ForkedProc.wait() blocks
              → parent's nursery hangs
                → parent's async_main hangs
                  → (recurse up)

The precise new question: **what task in the 5
stuck actors' `async_main` never completes?**
Candidates:
1. shielded parent-chan `process_messages` task
   in `root_tn` — but we cancel it via
   `_parent_chan_cs.cancel()` in `Actor.cancel()`,
   which only runs during
   `open_root_actor.__aexit__`, which itself runs
   only after `async_main`'s outer unwind — which
   doesn't happen. So the shield isn't broken in
   this path.
2. `actor_nursery._join_procs.wait()` or similar
   inline in the backend `*_proc` flow.
3. `_ForkedProc.wait()` on a grandchild that DID
   exit — but pidfd_open watch didn't fire (race
   between `pidfd_open` and the child exiting?).

Most specific next probe: add DIAG around
`_ForkedProc.wait()` enter/exit to see whether
pidfd-based wait returns for every grandchild
exit. If a stuck parent's `_ForkedProc.wait()`
never returns despite its child exiting → pidfd
mechanism has a race bug under nested forkserver.

Asymmetry observed in the cascade tree: some d=0
spawners exit cleanly, others stick, even though
they started identically. Not purely depth-
determined — some race condition in nursery
teardown when multiple siblings error
simultaneously.

No code changes — diagnosis-only.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
subint_forkserver_backend
Gud Boi 2026-04-23 21:36:19 -04:00
parent ab86f7613d
commit 4d0555435b
1 changed files with 95 additions and 0 deletions

View File

@ -540,6 +540,101 @@ their owning actor's `Actor.cancel()` never runs.
The recvs are fine — they're just parked because The recvs are fine — they're just parked because
nothing is telling them to stop. nothing is telling them to stop.
## Update — 2026-04-23 (very late): leaves exit, middle actors stuck in `trio.run`
Yet another instrumentation pass — this time
printing at:
- `_worker` child branch: `pre child_target()` /
`child_target RETURNED rc=N` / `about to
os._exit(rc)`
- `_trio_main`: `about to trio.run` /
`trio.run RETURNED NORMALLY` / `FINALLY`
**Fresh-run results** (`test_nested_multierrors[
subint_forkserver]`, depth=1/breadth=2, 1 root + 14
forked = 15 actors total):
- **9 processes completed the full flow**
`trio.run RETURNED NORMALLY` → `child_target
RETURNED rc=0` → `about to os._exit(0)`. These
are the LEAVES of the tree (errorer actors) plus
their direct parents (depth-0 spawners). They
actually exit their processes.
- **5 processes are stuck INSIDE `trio.run(trio_main)`**
— they hit "about to trio.run" but NEVER see
"trio.run RETURNED NORMALLY". These are root +
top-level spawners + one intermediate.
**What this means:** `async_main` itself is the
deadlock holder, not the peer-channel loops.
Specifically, the outer `async with root_tn:` in
`async_main` never exits for the 5 stuck actors.
Their `trio.run` never returns → `_trio_main`
catch/finally never runs → `_worker` never reaches
`os._exit(rc)` → the PROCESS never dies → its
parent's `_ForkedProc.wait()` blocks → parent's
nursery hangs → parent's `async_main` hangs → ...
### The new precise question
**What task in the 5 stuck actors' `async_main`
never completes?** Candidates:
1. The shielded parent-chan `process_messages`
task in `root_tn` — but we explicitly cancel it
via `_parent_chan_cs.cancel()` in `Actor.cancel()`.
However, `Actor.cancel()` only runs during
`open_root_actor.__aexit__`, which itself runs
only after `async_main`'s outer unwind — which
doesn't happen. So the shield isn't broken.
2. `await actor_nursery._join_procs.wait()` or
similar in the inline backend `*_proc` flow.
3. `_ForkedProc.wait()` on a grandchild that
actually DID exit — but the pidfd_open watch
didn't fire for some reason (race between
pidfd_open and the child exiting?).
The most specific next probe: **add DIAG around
`_ForkedProc.wait()` enter/exit** to see whether
the pidfd-based wait returns for every grandchild
exit. If a stuck parent's `_ForkedProc.wait()`
NEVER returns despite its child exiting, the
pidfd mechanism has a race bug under nested
forkserver.
Alternative probe: instrument `async_main`'s outer
nursery exits to find which nursery's `__aexit__`
is stuck, drilling down from `trio.run` to the
specific `async with` that never completes.
### Cascade summary (updated tree view)
```
ROOT (pytest) STUCK in trio.run
├── top_0 (spawner, d=1) STUCK in trio.run
│ ├── spawner_0_d1_0 (d=0) exited (os._exit 0)
│ │ ├── errorer_0_0 exited (os._exit 0)
│ │ └── errorer_0_1 exited (os._exit 0)
│ └── spawner_0_d1_1 (d=0) exited (os._exit 0)
│ ├── errorer_0_2 exited (os._exit 0)
│ └── errorer_0_3 exited (os._exit 0)
└── top_1 (spawner, d=1) STUCK in trio.run
├── spawner_1_d1_0 (d=0) STUCK in trio.run (sibling race?)
│ ├── errorer_1_0 exited
│ └── errorer_1_1 exited
└── spawner_1_d1_1 (d=0) STUCK in trio.run
├── errorer_1_2 exited
└── errorer_1_3 exited
```
Grandchildren (d=0 spawners) exit OR stick —
asymmetric. Not purely depth-determined. Some race
condition in nursery teardown when multiple
siblings error simultaneously.
## Stopgap (landed) ## Stopgap (landed)
`test_nested_multierrors` skip-marked under `test_nested_multierrors` skip-marked under