tractor

History

Gud Boi 4d0555435b Narrow forkserver hang to `async_main` outer tn Fourth diagnostic pass — instrument `_worker`'s fork-child branch (`pre child_target()` / `child_ target RETURNED rc=N` / `about to os._exit(rc)`) and `_trio_main` boundaries (`about to trio.run` / `trio.run RETURNED NORMALLY` / `FINALLY`). Test config: depth=1/breadth=2 = 1 root + 14 forked = 15 actors total. Fresh-run results, - 9 processes complete the full flow: `trio.run RETURNED NORMALLY` → `child_target RETURNED rc=0` → `os._exit(0)`. These are tree LEAVES (errorers) plus their direct parents (depth-0 spawners) — they actually exit - 5 processes stuck INSIDE `trio.run(trio_ main)`: hit "about to trio.run" but never see "trio.run RETURNED NORMALLY". These are root + top-level spawners + one intermediate The deadlock is in `async_main` itself, NOT the peer-channel loops. Specifically, the outer `async with root_tn:` in `async_main` never exits for the 5 stuck actors, so the cascade wedges: trio.run never returns → _trio_main finally never runs → _worker never reaches os._exit(rc) → process never dies → parent's _ForkedProc.wait() blocks → parent's nursery hangs → parent's async_main hangs → (recurse up) The precise new question: what task in the 5 stuck actors' `async_main` never completes? Candidates: 1. shielded parent-chan `process_messages` task in `root_tn` — but we cancel it via `_parent_chan_cs.cancel()` in `Actor.cancel()`, which only runs during `open_root_actor.__aexit__`, which itself runs only after `async_main`'s outer unwind — which doesn't happen. So the shield isn't broken in this path. 2. `actor_nursery._join_procs.wait()` or similar inline in the backend `*_proc` flow. 3. `_ForkedProc.wait()` on a grandchild that DID exit — but pidfd_open watch didn't fire (race between `pidfd_open` and the child exiting?). Most specific next probe: add DIAG around `_ForkedProc.wait()` enter/exit to see whether pidfd-based wait returns for every grandchild exit. If a stuck parent's `_ForkedProc.wait()` never returns despite its child exiting → pidfd mechanism has a race bug under nested forkserver. Asymmetry observed in the cascade tree: some d=0 spawners exit cleanly, others stick, even though they started identically. Not purely depth- determined — some race condition in nursery teardown when multiple siblings error simultaneously. No code changes — diagnosis-only. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 21:36:19 -04:00
..
conc-anal	Narrow forkserver hang to `async_main` outer tn	2026-04-23 21:36:19 -04:00
prompt-io	Add CPython-level `subint_fork` workaround smoketest	2026-04-23 18:48:34 -04:00

Gud Boi 4d0555435b Narrow forkserver hang to `async_main` outer tn

Fourth diagnostic pass — instrument `_worker`'s
fork-child branch (`pre child_target()` / `child_
target RETURNED rc=N` / `about to os._exit(rc)`)
and `_trio_main` boundaries (`about to trio.run` /
`trio.run RETURNED NORMALLY` / `FINALLY`). Test
config: depth=1/breadth=2 = 1 root + 14 forked =
15 actors total.

Fresh-run results,
- **9 processes complete the full flow**:
  `trio.run RETURNED NORMALLY` → `child_target
  RETURNED rc=0` → `os._exit(0)`. These are tree
  LEAVES (errorers) plus their direct parents
  (depth-0 spawners) — they actually exit
- **5 processes stuck INSIDE `trio.run(trio_
  main)`**: hit "about to trio.run" but never
  see "trio.run RETURNED NORMALLY". These are
  root + top-level spawners + one intermediate

The deadlock is in `async_main` itself, NOT the
peer-channel loops. Specifically, the outer
`async with root_tn:` in `async_main` never exits
for the 5 stuck actors, so the cascade wedges:

    trio.run never returns
      → _trio_main finally never runs
        → _worker never reaches os._exit(rc)
          → process never dies
            → parent's _ForkedProc.wait() blocks
              → parent's nursery hangs
                → parent's async_main hangs
                  → (recurse up)

The precise new question: **what task in the 5
stuck actors' `async_main` never completes?**
Candidates:
1. shielded parent-chan `process_messages` task
   in `root_tn` — but we cancel it via
   `_parent_chan_cs.cancel()` in `Actor.cancel()`,
   which only runs during
   `open_root_actor.__aexit__`, which itself runs
   only after `async_main`'s outer unwind — which
   doesn't happen. So the shield isn't broken in
   this path.
2. `actor_nursery._join_procs.wait()` or similar
   inline in the backend `*_proc` flow.
3. `_ForkedProc.wait()` on a grandchild that DID
   exit — but pidfd_open watch didn't fire (race
   between `pidfd_open` and the child exiting?).

Most specific next probe: add DIAG around
`_ForkedProc.wait()` enter/exit to see whether
pidfd-based wait returns for every grandchild
exit. If a stuck parent's `_ForkedProc.wait()`
never returns despite its child exiting → pidfd
mechanism has a race bug under nested forkserver.

Asymmetry observed in the cascade tree: some d=0
spawners exit cleanly, others stick, even though
they started identically. Not purely depth-
determined — some race condition in nursery
teardown when multiple siblings error
simultaneously.

No code changes — diagnosis-only.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

2026-04-23 21:36:19 -04:00

conc-anal

Narrow forkserver hang to `async_main` outer tn

2026-04-23 21:36:19 -04:00

prompt-io

Add CPython-level `subint_fork` workaround smoketest

2026-04-23 18:48:34 -04:00