tractor/ai/conc-anal
Gud Boi 4d0555435b Narrow forkserver hang to `async_main` outer tn
Fourth diagnostic pass — instrument `_worker`'s
fork-child branch (`pre child_target()` / `child_
target RETURNED rc=N` / `about to os._exit(rc)`)
and `_trio_main` boundaries (`about to trio.run` /
`trio.run RETURNED NORMALLY` / `FINALLY`). Test
config: depth=1/breadth=2 = 1 root + 14 forked =
15 actors total.

Fresh-run results,
- **9 processes complete the full flow**:
  `trio.run RETURNED NORMALLY` → `child_target
  RETURNED rc=0` → `os._exit(0)`. These are tree
  LEAVES (errorers) plus their direct parents
  (depth-0 spawners) — they actually exit
- **5 processes stuck INSIDE `trio.run(trio_
  main)`**: hit "about to trio.run" but never
  see "trio.run RETURNED NORMALLY". These are
  root + top-level spawners + one intermediate

The deadlock is in `async_main` itself, NOT the
peer-channel loops. Specifically, the outer
`async with root_tn:` in `async_main` never exits
for the 5 stuck actors, so the cascade wedges:

    trio.run never returns
      → _trio_main finally never runs
        → _worker never reaches os._exit(rc)
          → process never dies
            → parent's _ForkedProc.wait() blocks
              → parent's nursery hangs
                → parent's async_main hangs
                  → (recurse up)

The precise new question: **what task in the 5
stuck actors' `async_main` never completes?**
Candidates:
1. shielded parent-chan `process_messages` task
   in `root_tn` — but we cancel it via
   `_parent_chan_cs.cancel()` in `Actor.cancel()`,
   which only runs during
   `open_root_actor.__aexit__`, which itself runs
   only after `async_main`'s outer unwind — which
   doesn't happen. So the shield isn't broken in
   this path.
2. `actor_nursery._join_procs.wait()` or similar
   inline in the backend `*_proc` flow.
3. `_ForkedProc.wait()` on a grandchild that DID
   exit — but pidfd_open watch didn't fire (race
   between `pidfd_open` and the child exiting?).

Most specific next probe: add DIAG around
`_ForkedProc.wait()` enter/exit to see whether
pidfd-based wait returns for every grandchild
exit. If a stuck parent's `_ForkedProc.wait()`
never returns despite its child exiting → pidfd
mechanism has a race bug under nested forkserver.

Asymmetry observed in the cascade tree: some d=0
spawners exit cleanly, others stick, even though
they started identically. Not purely depth-
determined — some race condition in nursery
teardown when multiple siblings error
simultaneously.

No code changes — diagnosis-only.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-23 21:36:19 -04:00
..
subint_cancel_delivery_hang_issue.md Doc `subint` backend hang classes + arm `dump_on_hang` 2026-04-23 18:47:49 -04:00
subint_fork_blocked_by_cpython_post_fork_issue.md Doc `subint_fork` as blocked by CPython post-fork 2026-04-23 18:48:06 -04:00
subint_fork_from_main_thread_smoketest.py Add trio-parent tests for `_subint_forkserver` 2026-04-23 18:48:34 -04:00
subint_forkserver_orphan_sigint_hang_issue.md Refine `subint_forkserver` orphan-SIGINT diagnosis 2026-04-23 18:48:34 -04:00
subint_forkserver_test_cancellation_leak_issue.md Narrow forkserver hang to `async_main` outer tn 2026-04-23 21:36:19 -04:00
subint_forkserver_thread_constraints_on_pep684_issue.md Add `subint_forkserver` PEP 684 audit-plan doc 2026-04-23 18:48:34 -04:00
subint_sigint_starvation_issue.md Expand `subint` sigint-starvation hang catalog 2026-04-23 18:47:49 -04:00