Skip-mark + narrow `subint_forkserver` cancel hang

Two-part stopgap for the still-hanging `test_nested_multierrors[subint_forkserver]`: 1. Skip-mark the test via `@pytest.mark.skipon_spawn_backend('subint_forkserver', reason=...)` so it stops blocking the test matrix while the remaining bug is being chased. The reason string cross-refs the conc-anal doc for full context. 2. Update the conc-anal doc (`subint_forkserver_test_cancellation_leak_issue.md`) with the empirical state after the three nested- cancel fix commits (`0cd0b633` FD scrub + `fe540d02` pidfd wait + `57935804` parent-chan shield break) landed, narrowing the remaining hang from "everything broken" to "peer-channel loops don't exit on `service_tn` cancel". Deats from the DIAGDEBUG instrumentation pass, - 80 `process_messages` ENTERs, 75 EXITs → 5 stuck - ALL 40 `shield=True` ENTERs matched EXIT — the `_parent_chan_cs.cancel()` wiring from `57935804` works as intended for shielded loops. - the 5 stuck loops are all `shield=False` peer- channel handlers in `handle_stream_from_peer` (inbound connections handled by `stream_handler_tn`, which IS `service_tn` in the current config). - after `_parent_chan_cs.cancel()` fires, NEW shielded loops appear on the session reg_addr port — probably discovery-layer reconnection; doesn't block teardown but indicates the cascade has more moving parts than expected. The remaining unknown: why don't the 5 peer-channel loops exit when `service_tn.cancel_scope.cancel()` fires? They're not shielded, they're inside the service_tn scope, a standard cancel should propagate through. Some fork-config-specific divergence keeps them alive. Doc lists three follow-up experiments (stackscope dump, side-by-side `trio_proc` comparison, audit of the `tractor/ipc/_server.py:448` `except trio.Cancelled:` path). (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-23 16:44:15 -04:00 · 2026-04-23 16:44:15 -04:00 · fe89169f1c
parent 57935804e2
commit fe89169f1c
2 changed files with 107 additions and 4 deletions
--- a/ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md
+++ b/ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md
@ -306,13 +306,103 @@ root's `open_nursery` receives the
 `BaseExceptionGroup` containing the `AssertionError`
 from the errorer and unwinds cleanly.

+## Update — 2026-04-23: partial fix landed, deeper layer surfaced
+
+Three improvements landed as separate commits in the
+`subint_forkserver_backend` branch (see `git log`):
+
+1. **`_close_inherited_fds()` in fork-child prelude**
+   (`tractor/spawn/_subint_forkserver.py`). POSIX
+   close-fds-equivalent enumeration via
+   `/proc/self/fd` (or `RLIMIT_NOFILE` fallback), keep
+   only stdio. This is fix-direction (1) from the list
+   above — went with the blunt form rather than the
+   targeted enum-via-`actor.ipc_server` form, turns
+   out the aggressive close is safe because every
+   inheritable resource the fresh child needs
+   (IPC-channel socket, etc.) is opened AFTER the
+   fork anyway.
+2. **`_ForkedProc.wait()` via `os.pidfd_open()` +
+   `trio.lowlevel.wait_readable()`** — matches the
+   `trio.Process.wait` / `mp.Process.sentinel` pattern
+   used by `trio_proc` and `proc_waiter`. Gives us
+   fully trio-cancellable child-wait (prior impl
+   blocked a cache thread on a sync `os.waitpid` that
+   was NOT trio-cancellable due to
+   `abandon_on_cancel=False`).
+3. **`_parent_chan_cs` wiring** in
+   `tractor/runtime/_runtime.py`: capture the shielded
+   `loop_cs` for the parent-channel `process_messages`
+   task in `async_main`; explicitly cancel it in
+   `Actor.cancel()` teardown. This breaks the shield
+   during teardown so the parent-chan loop exits when
+   cancel is issued, instead of parking on a parent-
+   socket EOF that might never arrive under fork
+   semantics.
+
+**Concrete wins from (1):** the sibling
+`subint_forkserver_orphan_sigint_hang_issue.md` class
+is **now fixed** — `test_orphaned_subactor_sigint_cleanup_DRAFT`
+went from strict-xfail to pass. The xfail mark was
+removed; the test remains as a regression guard.
+
+**test_nested_multierrors STILL hangs** though.
+
+### Updated diagnosis (narrowed)
+
+DIAGDEBUG instrumentation of `process_messages` ENTER/
+EXIT pairs + `_parent_chan_cs.cancel()` call sites
+showed (captured during a 20s-timeout repro):
+
+- 80 `process_messages` ENTERs, 75 EXITs → 5 stuck.
+- **All 40 `shield=True` ENTERs matched EXIT** — every
+  shielded parent-chan loop exits cleanly. The
+  `_parent_chan_cs` wiring works as intended.
+- **The 5 stuck loops are all `shield=False`** — peer-
+  channel handlers (inbound connections handled by
+  `handle_stream_from_peer` in stream_handler_tn).
+- After our `_parent_chan_cs.cancel()` fires, NEW
+  shielded process_messages loops start (on the
+  session reg_addr port — probably discovery-layer
+  reconnection attempts). These don't block teardown
+  (they all exit) but indicate the cancel cascade has
+  more moving parts than expected.
+
+### Remaining unknown
+
+Why don't the 5 peer-channel loops exit when
+`service_tn.cancel_scope.cancel()` fires? They're in
+`stream_handler_tn` which IS `service_tn` in the
+current configuration (`open_ipc_server(parent_tn=
+service_tn, stream_handler_tn=service_tn)`). A
+standard nursery-scope-cancel should propagate through
+them — no shield, no special handler. Something
+specific to the fork-spawned configuration keeps them
+alive.
+
+Candidate follow-up experiments:
+
+- Dump the trio task tree at the hang point (via
+  `stackscope` or direct trio introspection) to see
+  what each stuck loop is awaiting. `chan.__anext__`
+  on a socket recv? An inner lock? A shielded sub-task?
+- Compare peer-channel handler lifecycle under
+  `trio_proc` vs `subint_forkserver` with equivalent
+  logging to spot the divergence.
+- Investigate whether the peer handler is caught in
+  the `except trio.Cancelled:` path at
+  `tractor/ipc/_server.py:448` that re-raises — but
+  re-raise means it should still exit. Unless
+  something higher up swallows it.
+
 ## Stopgap (landed)

-Until the fix lands, `test_nested_multierrors` +
-related multi-level-spawn tests can be skip-marked
-under `subint_forkserver` via
+`test_nested_multierrors` skip-marked under
+`subint_forkserver` via
 `@pytest.mark.skipon_spawn_backend('subint_forkserver',
-reason='...')`. Cross-ref this doc.
+reason='...')`, cross-referenced to this doc. Mark
+should be dropped once the peer-channel-loop exit
+issue is fixed.

 ## References

--- a/tests/test_cancellation.py
+++ b/tests/test_cancellation.py
@ -452,6 +452,19 @@ async def spawn_and_error(
            await nursery.run_in_actor(*args, **kwargs)


+@pytest.mark.skipon_spawn_backend(
+    'subint_forkserver',
+    reason=(
+        'Multi-level fork-spawn cancel cascade hang — '
+        'peer-channel `process_messages` loops do not '
+        'exit on `service_tn.cancel_scope.cancel()`. '
+        'See `ai/conc-anal/'
+        'subint_forkserver_test_cancellation_leak_issue.md` '
+        'for the full diagnosis + candidate fix directions. '
+        'Drop this mark once the peer-chan-loop exit issue '
+        'is closed.'
+    ),
+)
@pytest.mark.timeout(
    10,
    method='thread',