tractor/ai
Gud Boi ab86f7613d Refine `subint_forkserver` cancel-cascade diag
Third diagnostic pass on
`test_nested_multierrors[subint_forkserver]` hang.
Two prior hypotheses ruled out + a new, more
specific deadlock shape identified.

Ruled out,
- **capture-pipe fill** (`-s` flag changes test):
  retested explicitly — `test_nested_multierrors`
  hangs identically with and without `-s`. The
  earlier observation was likely a competing
  pytest process I had running in another session
  holding registry state
- **stuck peer-chan recv that cancel can't
  break**: pivot from the prior pass. With
  `handle_stream_from_peer` instrumented at ENTER
  / `except trio.Cancelled:` / finally: 40
  ENTERs, ZERO `trio.Cancelled` hits. Cancel never
  reaches those tasks at all — the recvs are
  fine, nothing is telling them to stop

Actual deadlock shape: multi-level mutual wait.

    root              blocks on spawner.wait()
      spawner         blocks on grandchild.wait()
        grandchild    blocks on errorer.wait()
          errorer     Actor.cancel() ran, but proc
                      never exits

`Actor.cancel()` fired in 12 PIDs — but NOT in
root + 2 direct spawners. Those 3 have peer
handlers stuck because their own `Actor.cancel()`
never runs, which only runs when the enclosing
`tractor.open_nursery()` exits, which waits on
`_ForkedProc.wait()` for the child pidfd to
signal, which only signals when the child
process fully exits.

Refined question: **why does an errorer process
not exit after its `Actor.cancel()` completes?**
Three hypotheses (unverified):
1. `_parent_chan_cs.cancel()` fires but the
   shielded loop's recv is stuck in a way cancel
   still can't break
2. `async_main`'s post-cancel unwind has other
   tasks in `root_tn` awaiting something that
   never arrives (e.g. outbound IPC reply)
3. `os._exit(rc)` in `_worker` never runs because
   `_child_target` never returns

Next-session probes (priority order):
1. instrument `_worker`'s fork-child branch —
   confirm whether `child_target()` returns /
   `os._exit(rc)` is reached for errorer PIDs
2. instrument `async_main`'s final unwind — see
   which await in teardown doesn't complete
3. compare under `trio_proc` backend at the
   equivalent level to spot divergence

No code changes — diagnosis-only.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-23 21:23:11 -04:00
..
conc-anal Refine `subint_forkserver` cancel-cascade diag 2026-04-23 21:23:11 -04:00
prompt-io Add CPython-level `subint_fork` workaround smoketest 2026-04-23 18:48:34 -04:00