Narrow forkserver hang to `async_main` outer tn
Fourth diagnostic pass — instrument `_worker`'s
fork-child branch (`pre child_target()` / `child_
target RETURNED rc=N` / `about to os._exit(rc)`)
and `_trio_main` boundaries (`about to trio.run` /
`trio.run RETURNED NORMALLY` / `FINALLY`). Test
config: depth=1/breadth=2 = 1 root + 14 forked =
15 actors total.
Fresh-run results,
- **9 processes complete the full flow**:
`trio.run RETURNED NORMALLY` → `child_target
RETURNED rc=0` → `os._exit(0)`. These are tree
LEAVES (errorers) plus their direct parents
(depth-0 spawners) — they actually exit
- **5 processes stuck INSIDE `trio.run(trio_
main)`**: hit "about to trio.run" but never
see "trio.run RETURNED NORMALLY". These are
root + top-level spawners + one intermediate
The deadlock is in `async_main` itself, NOT the
peer-channel loops. Specifically, the outer
`async with root_tn:` in `async_main` never exits
for the 5 stuck actors, so the cascade wedges:
trio.run never returns
→ _trio_main finally never runs
→ _worker never reaches os._exit(rc)
→ process never dies
→ parent's _ForkedProc.wait() blocks
→ parent's nursery hangs
→ parent's async_main hangs
→ (recurse up)
The precise new question: **what task in the 5
stuck actors' `async_main` never completes?**
Candidates:
1. shielded parent-chan `process_messages` task
in `root_tn` — but we cancel it via
`_parent_chan_cs.cancel()` in `Actor.cancel()`,
which only runs during
`open_root_actor.__aexit__`, which itself runs
only after `async_main`'s outer unwind — which
doesn't happen. So the shield isn't broken in
this path.
2. `actor_nursery._join_procs.wait()` or similar
inline in the backend `*_proc` flow.
3. `_ForkedProc.wait()` on a grandchild that DID
exit — but pidfd_open watch didn't fire (race
between `pidfd_open` and the child exiting?).
Most specific next probe: add DIAG around
`_ForkedProc.wait()` enter/exit to see whether
pidfd-based wait returns for every grandchild
exit. If a stuck parent's `_ForkedProc.wait()`
never returns despite its child exiting → pidfd
mechanism has a race bug under nested forkserver.
Asymmetry observed in the cascade tree: some d=0
spawners exit cleanly, others stick, even though
they started identically. Not purely depth-
determined — some race condition in nursery
teardown when multiple siblings error
simultaneously.
No code changes — diagnosis-only.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
subint_forkserver_backend
parent
ab86f7613d
commit
4d0555435b
|
|
@ -540,6 +540,101 @@ their owning actor's `Actor.cancel()` never runs.
|
||||||
The recvs are fine — they're just parked because
|
The recvs are fine — they're just parked because
|
||||||
nothing is telling them to stop.
|
nothing is telling them to stop.
|
||||||
|
|
||||||
|
## Update — 2026-04-23 (very late): leaves exit, middle actors stuck in `trio.run`
|
||||||
|
|
||||||
|
Yet another instrumentation pass — this time
|
||||||
|
printing at:
|
||||||
|
|
||||||
|
- `_worker` child branch: `pre child_target()` /
|
||||||
|
`child_target RETURNED rc=N` / `about to
|
||||||
|
os._exit(rc)`
|
||||||
|
- `_trio_main`: `about to trio.run` /
|
||||||
|
`trio.run RETURNED NORMALLY` / `FINALLY`
|
||||||
|
|
||||||
|
**Fresh-run results** (`test_nested_multierrors[
|
||||||
|
subint_forkserver]`, depth=1/breadth=2, 1 root + 14
|
||||||
|
forked = 15 actors total):
|
||||||
|
|
||||||
|
- **9 processes completed the full flow** —
|
||||||
|
`trio.run RETURNED NORMALLY` → `child_target
|
||||||
|
RETURNED rc=0` → `about to os._exit(0)`. These
|
||||||
|
are the LEAVES of the tree (errorer actors) plus
|
||||||
|
their direct parents (depth-0 spawners). They
|
||||||
|
actually exit their processes.
|
||||||
|
- **5 processes are stuck INSIDE `trio.run(trio_main)`**
|
||||||
|
— they hit "about to trio.run" but NEVER see
|
||||||
|
"trio.run RETURNED NORMALLY". These are root +
|
||||||
|
top-level spawners + one intermediate.
|
||||||
|
|
||||||
|
**What this means:** `async_main` itself is the
|
||||||
|
deadlock holder, not the peer-channel loops.
|
||||||
|
Specifically, the outer `async with root_tn:` in
|
||||||
|
`async_main` never exits for the 5 stuck actors.
|
||||||
|
Their `trio.run` never returns → `_trio_main`
|
||||||
|
catch/finally never runs → `_worker` never reaches
|
||||||
|
`os._exit(rc)` → the PROCESS never dies → its
|
||||||
|
parent's `_ForkedProc.wait()` blocks → parent's
|
||||||
|
nursery hangs → parent's `async_main` hangs → ...
|
||||||
|
|
||||||
|
### The new precise question
|
||||||
|
|
||||||
|
**What task in the 5 stuck actors' `async_main`
|
||||||
|
never completes?** Candidates:
|
||||||
|
|
||||||
|
1. The shielded parent-chan `process_messages`
|
||||||
|
task in `root_tn` — but we explicitly cancel it
|
||||||
|
via `_parent_chan_cs.cancel()` in `Actor.cancel()`.
|
||||||
|
However, `Actor.cancel()` only runs during
|
||||||
|
`open_root_actor.__aexit__`, which itself runs
|
||||||
|
only after `async_main`'s outer unwind — which
|
||||||
|
doesn't happen. So the shield isn't broken.
|
||||||
|
|
||||||
|
2. `await actor_nursery._join_procs.wait()` or
|
||||||
|
similar in the inline backend `*_proc` flow.
|
||||||
|
|
||||||
|
3. `_ForkedProc.wait()` on a grandchild that
|
||||||
|
actually DID exit — but the pidfd_open watch
|
||||||
|
didn't fire for some reason (race between
|
||||||
|
pidfd_open and the child exiting?).
|
||||||
|
|
||||||
|
The most specific next probe: **add DIAG around
|
||||||
|
`_ForkedProc.wait()` enter/exit** to see whether
|
||||||
|
the pidfd-based wait returns for every grandchild
|
||||||
|
exit. If a stuck parent's `_ForkedProc.wait()`
|
||||||
|
NEVER returns despite its child exiting, the
|
||||||
|
pidfd mechanism has a race bug under nested
|
||||||
|
forkserver.
|
||||||
|
|
||||||
|
Alternative probe: instrument `async_main`'s outer
|
||||||
|
nursery exits to find which nursery's `__aexit__`
|
||||||
|
is stuck, drilling down from `trio.run` to the
|
||||||
|
specific `async with` that never completes.
|
||||||
|
|
||||||
|
### Cascade summary (updated tree view)
|
||||||
|
|
||||||
|
```
|
||||||
|
ROOT (pytest) STUCK in trio.run
|
||||||
|
├── top_0 (spawner, d=1) STUCK in trio.run
|
||||||
|
│ ├── spawner_0_d1_0 (d=0) exited (os._exit 0)
|
||||||
|
│ │ ├── errorer_0_0 exited (os._exit 0)
|
||||||
|
│ │ └── errorer_0_1 exited (os._exit 0)
|
||||||
|
│ └── spawner_0_d1_1 (d=0) exited (os._exit 0)
|
||||||
|
│ ├── errorer_0_2 exited (os._exit 0)
|
||||||
|
│ └── errorer_0_3 exited (os._exit 0)
|
||||||
|
└── top_1 (spawner, d=1) STUCK in trio.run
|
||||||
|
├── spawner_1_d1_0 (d=0) STUCK in trio.run (sibling race?)
|
||||||
|
│ ├── errorer_1_0 exited
|
||||||
|
│ └── errorer_1_1 exited
|
||||||
|
└── spawner_1_d1_1 (d=0) STUCK in trio.run
|
||||||
|
├── errorer_1_2 exited
|
||||||
|
└── errorer_1_3 exited
|
||||||
|
```
|
||||||
|
|
||||||
|
Grandchildren (d=0 spawners) exit OR stick —
|
||||||
|
asymmetric. Not purely depth-determined. Some race
|
||||||
|
condition in nursery teardown when multiple
|
||||||
|
siblings error simultaneously.
|
||||||
|
|
||||||
## Stopgap (landed)
|
## Stopgap (landed)
|
||||||
|
|
||||||
`test_nested_multierrors` skip-marked under
|
`test_nested_multierrors` skip-marked under
|
||||||
|
|
|
||||||
Loading…
Reference in New Issue