From 506617c695698fb4928560b0996d9605fd00fd35 Mon Sep 17 00:00:00 2001 From: goodboy Date: Thu, 23 Apr 2026 16:44:15 -0400 Subject: [PATCH] Skip-mark + narrow `subint_forkserver` cancel hang MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Two-part stopgap for the still-hanging `test_nested_multierrors[subint_forkserver]`: 1. Skip-mark the test via `@pytest.mark.skipon_spawn_backend('subint_forkserver', reason=...)` so it stops blocking the test matrix while the remaining bug is being chased. The reason string cross-refs the conc-anal doc for full context. 2. Update the conc-anal doc (`subint_forkserver_test_cancellation_leak_issue.md`) with the empirical state after the three nested- cancel fix commits (`0cd0b633` FD scrub + `fe540d02` pidfd wait + `57935804` parent-chan shield break) landed, narrowing the remaining hang from "everything broken" to "peer-channel loops don't exit on `service_tn` cancel". Deats from the DIAGDEBUG instrumentation pass, - 80 `process_messages` ENTERs, 75 EXITs → 5 stuck - ALL 40 `shield=True` ENTERs matched EXIT — the `_parent_chan_cs.cancel()` wiring from `57935804` works as intended for shielded loops. - the 5 stuck loops are all `shield=False` peer- channel handlers in `handle_stream_from_peer` (inbound connections handled by `stream_handler_tn`, which IS `service_tn` in the current config). - after `_parent_chan_cs.cancel()` fires, NEW shielded loops appear on the session reg_addr port — probably discovery-layer reconnection; doesn't block teardown but indicates the cascade has more moving parts than expected. The remaining unknown: why don't the 5 peer-channel loops exit when `service_tn.cancel_scope.cancel()` fires? They're not shielded, they're inside the service_tn scope, a standard cancel should propagate through. Some fork-config-specific divergence keeps them alive. Doc lists three follow-up experiments (stackscope dump, side-by-side `trio_proc` comparison, audit of the `tractor/ipc/_server.py:448` `except trio.Cancelled:` path). (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code --- ...forkserver_test_cancellation_leak_issue.md | 98 ++++++++++++++++++- tests/test_cancellation.py | 13 +++ 2 files changed, 107 insertions(+), 4 deletions(-) diff --git a/ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md b/ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md index 4ca722cc..8762bba7 100644 --- a/ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md +++ b/ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md @@ -306,13 +306,103 @@ root's `open_nursery` receives the `BaseExceptionGroup` containing the `AssertionError` from the errorer and unwinds cleanly. +## Update — 2026-04-23: partial fix landed, deeper layer surfaced + +Three improvements landed as separate commits in the +`subint_forkserver_backend` branch (see `git log`): + +1. **`_close_inherited_fds()` in fork-child prelude** + (`tractor/spawn/_subint_forkserver.py`). POSIX + close-fds-equivalent enumeration via + `/proc/self/fd` (or `RLIMIT_NOFILE` fallback), keep + only stdio. This is fix-direction (1) from the list + above — went with the blunt form rather than the + targeted enum-via-`actor.ipc_server` form, turns + out the aggressive close is safe because every + inheritable resource the fresh child needs + (IPC-channel socket, etc.) is opened AFTER the + fork anyway. +2. **`_ForkedProc.wait()` via `os.pidfd_open()` + + `trio.lowlevel.wait_readable()`** — matches the + `trio.Process.wait` / `mp.Process.sentinel` pattern + used by `trio_proc` and `proc_waiter`. Gives us + fully trio-cancellable child-wait (prior impl + blocked a cache thread on a sync `os.waitpid` that + was NOT trio-cancellable due to + `abandon_on_cancel=False`). +3. **`_parent_chan_cs` wiring** in + `tractor/runtime/_runtime.py`: capture the shielded + `loop_cs` for the parent-channel `process_messages` + task in `async_main`; explicitly cancel it in + `Actor.cancel()` teardown. This breaks the shield + during teardown so the parent-chan loop exits when + cancel is issued, instead of parking on a parent- + socket EOF that might never arrive under fork + semantics. + +**Concrete wins from (1):** the sibling +`subint_forkserver_orphan_sigint_hang_issue.md` class +is **now fixed** — `test_orphaned_subactor_sigint_cleanup_DRAFT` +went from strict-xfail to pass. The xfail mark was +removed; the test remains as a regression guard. + +**test_nested_multierrors STILL hangs** though. + +### Updated diagnosis (narrowed) + +DIAGDEBUG instrumentation of `process_messages` ENTER/ +EXIT pairs + `_parent_chan_cs.cancel()` call sites +showed (captured during a 20s-timeout repro): + +- 80 `process_messages` ENTERs, 75 EXITs → 5 stuck. +- **All 40 `shield=True` ENTERs matched EXIT** — every + shielded parent-chan loop exits cleanly. The + `_parent_chan_cs` wiring works as intended. +- **The 5 stuck loops are all `shield=False`** — peer- + channel handlers (inbound connections handled by + `handle_stream_from_peer` in stream_handler_tn). +- After our `_parent_chan_cs.cancel()` fires, NEW + shielded process_messages loops start (on the + session reg_addr port — probably discovery-layer + reconnection attempts). These don't block teardown + (they all exit) but indicate the cancel cascade has + more moving parts than expected. + +### Remaining unknown + +Why don't the 5 peer-channel loops exit when +`service_tn.cancel_scope.cancel()` fires? They're in +`stream_handler_tn` which IS `service_tn` in the +current configuration (`open_ipc_server(parent_tn= +service_tn, stream_handler_tn=service_tn)`). A +standard nursery-scope-cancel should propagate through +them — no shield, no special handler. Something +specific to the fork-spawned configuration keeps them +alive. + +Candidate follow-up experiments: + +- Dump the trio task tree at the hang point (via + `stackscope` or direct trio introspection) to see + what each stuck loop is awaiting. `chan.__anext__` + on a socket recv? An inner lock? A shielded sub-task? +- Compare peer-channel handler lifecycle under + `trio_proc` vs `subint_forkserver` with equivalent + logging to spot the divergence. +- Investigate whether the peer handler is caught in + the `except trio.Cancelled:` path at + `tractor/ipc/_server.py:448` that re-raises — but + re-raise means it should still exit. Unless + something higher up swallows it. + ## Stopgap (landed) -Until the fix lands, `test_nested_multierrors` + -related multi-level-spawn tests can be skip-marked -under `subint_forkserver` via +`test_nested_multierrors` skip-marked under +`subint_forkserver` via `@pytest.mark.skipon_spawn_backend('subint_forkserver', -reason='...')`. Cross-ref this doc. +reason='...')`, cross-referenced to this doc. Mark +should be dropped once the peer-channel-loop exit +issue is fixed. ## References diff --git a/tests/test_cancellation.py b/tests/test_cancellation.py index 17f19723..3776b3e3 100644 --- a/tests/test_cancellation.py +++ b/tests/test_cancellation.py @@ -452,6 +452,19 @@ async def spawn_and_error( await nursery.run_in_actor(*args, **kwargs) +@pytest.mark.skipon_spawn_backend( + 'subint_forkserver', + reason=( + 'Multi-level fork-spawn cancel cascade hang — ' + 'peer-channel `process_messages` loops do not ' + 'exit on `service_tn.cancel_scope.cancel()`. ' + 'See `ai/conc-anal/' + 'subint_forkserver_test_cancellation_leak_issue.md` ' + 'for the full diagnosis + candidate fix directions. ' + 'Drop this mark once the peer-chan-loop exit issue ' + 'is closed.' + ), +) @pytest.mark.timeout( 10, method='thread',