From 35796ec8ae4a6db3b892adc97fc8733aa8938759 Mon Sep 17 00:00:00 2001 From: goodboy Date: Mon, 20 Apr 2026 15:28:00 -0400 Subject: [PATCH] Doc `subint` backend hang classes + arm `dump_on_hang` MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Classify and write up the two distinct hang modes hit during Phase B subint bringup (issue #379) so future triage doesn't re-derive them from scratch. Deats, two new `ai/conc-anal/` docs, - `subint_sigint_starvation_issue.md`: abandoned legacy-subint thread + shared GIL → main trio loop starves → signal-wakeup-fd pipe fills → `SIGINT` silently dropped (`strace` shows `write() = EAGAIN` on the wakeup-fd). Un- Ctrl-C-able. Structurally a CPython limit; blocked on `msgspec` PEP 684 (jcrist/msgspec#563) - `subint_cancel_delivery_hang_issue.md`: parent-side trio task parks on an orphaned IPC channel after subint teardown — no clean EOF delivered to the waiting receive. Ctrl-C-able (main loop iterates fine); OUR bug to fix. Candidate fix: explicit parent-side channel abort in `subint_proc`'s hard-kill teardown Cross-link the docs from their test reproducers, - `test_stale_entry_is_deleted` (→ starvation class): wrap `trio.run(main)` in `dump_on_hang(seconds=20)` so a future regression captures a stack dump. Kept un- skipped so the dump file is inspectable - `test_subint_non_checkpointing_child` (→ delivery class): extend docstring with a "KNOWN ISSUE" block pointing at the analysis (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code --- .../subint_cancel_delivery_hang_issue.md | 161 ++++++++++++++ .../subint_sigint_starvation_issue.md | 205 ++++++++++++++++++ tests/discovery/test_registrar.py | 52 ++++- tests/test_subint_cancellation.py | 26 +++ 4 files changed, 443 insertions(+), 1 deletion(-) create mode 100644 ai/conc-anal/subint_cancel_delivery_hang_issue.md create mode 100644 ai/conc-anal/subint_sigint_starvation_issue.md diff --git a/ai/conc-anal/subint_cancel_delivery_hang_issue.md b/ai/conc-anal/subint_cancel_delivery_hang_issue.md new file mode 100644 index 00000000..4c3112ed --- /dev/null +++ b/ai/conc-anal/subint_cancel_delivery_hang_issue.md @@ -0,0 +1,161 @@ +# `subint` backend: parent trio loop parks after subint teardown (Ctrl-C works; not a CPython-level issue) + +Follow-up to the Phase B subint spawn-backend PR (see +`tractor.spawn._subint`, issue #379). Distinct from the +`subint_sigint_starvation_issue.md` (SIGINT-unresponsive +starvation hang): this one is **Ctrl-C-able**, which means +it's *not* the shared-GIL-hostage class and is ours to fix +from inside tractor rather than waiting on upstream CPython +/ msgspec progress. + +## TL;DR + +After a stuck-subint subactor is torn down via the +hard-kill path, a parent-side trio task parks on an +*orphaned resource* (most likely a `chan.recv()` / +`process_messages` loop on the now-dead subint's IPC +channel) and waits forever for bytes that can't arrive — +because the channel was torn down without emitting a clean +EOF/`BrokenResourceError` to the waiting receiver. + +Unlike `subint_sigint_starvation_issue.md`, the main trio +loop **is** iterating normally — SIGINT delivers cleanly +and the test unhangs. But absent Ctrl-C, the test suite +wedges indefinitely. + +## Symptom + +Running `test_subint_non_checkpointing_child` under +`--spawn-backend=subint` (in +`tests/test_subint_cancellation.py`): + +1. Test spawns a subactor whose main task runs + `threading.Event.wait(1.0)` in a loop — releases the + GIL but never inserts a trio checkpoint. +2. Parent does `an.cancel_scope.cancel()`. Our + `subint_proc` cancel path fires: soft-kill sends + `Portal.cancel_actor()` over the live IPC channel → + subint's trio loop *should* process the cancel msg on + its IPC dispatcher task (since the GIL releases are + happening). +3. Expected: subint's `trio.run()` unwinds, driver thread + exits naturally, parent returns. +4. Actual: parent `trio.run()` never completes. Test + hangs past its `trio.fail_after()` deadline. + +## Evidence + +### `strace` on the hung pytest process during SIGINT + +``` +--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} --- +write(17, "\2", 1) = 1 +``` + +Contrast with the SIGINT-starvation hang (see +`subint_sigint_starvation_issue.md`) where that same +`write()` returned `EAGAIN`. Here the SIGINT byte is +written successfully → Python's signal handler pipe is +being drained → main trio loop **is** iterating → SIGINT +gets turned into `trio.Cancelled` → the test unhangs (if +the operator happens to be there to hit Ctrl-C). + +### Stack dump (via `tractor.devx.dump_on_hang`) + +Single main thread visible, parked in +`trio._core._io_epoll.get_events` inside `trio.run` at the +test's `trio.run(...)` call site. No subint driver thread +(subint was destroyed successfully — this is *after* the +hard-kill path, not during it). + +## Root cause hypothesis + +Most consistent with the evidence: a parent-side trio +task is awaiting a `chan.recv()` / `process_messages` loop +on the dead subint's IPC channel. The sequence: + +1. Soft-kill in `subint_proc` sends `Portal.cancel_actor()` + over the channel. The subint's trio dispatcher *may* or + may not have processed the cancel msg before the subint + was destroyed — timing-dependent. +2. Hard-kill timeout fires (because the subint's main + task was in `threading.Event.wait()` with no trio + checkpoint — cancel-msg processing couldn't race the + timeout). +3. Driver thread abandoned, `_interpreters.destroy()` + runs. Subint is gone. +4. But the parent-side trio task holding a + `chan.recv()` / `process_messages` loop against that + channel was **not** explicitly cancelled. The channel's + underlying socket got torn down, but without a clean + EOF delivered to the waiting recv, the task parks + forever on `trio.lowlevel.wait_readable` (or similar). + +This matches the "main loop fine, task parked on +orphaned I/O" signature. + +## Why this is ours to fix (not CPython's) + +- Main trio loop iterates normally → GIL isn't starved. +- SIGINT is deliverable → not a signal-pipe-full / + wakeup-fd contention scenario. +- The hang is in *our* supervision code, specifically in + how `subint_proc` tears down its side of the IPC when + the subint is abandoned/destroyed. + +## Possible fix directions + +1. **Explicit parent-side channel abort on subint + abandon.** In `subint_proc`'s teardown block, after the + hard-kill timeout fires, explicitly close the parent's + end of the IPC channel to the subint. Any waiting + `chan.recv()` / `process_messages` task sees + `BrokenResourceError` (or `ClosedResourceError`) and + unwinds. +2. **Cancel parent-side RPC tasks tied to the dead + subint's channel.** The `Actor._rpc_tasks` / nursery + machinery should have a handle on any + `process_messages` loops bound to a specific peer + channel. Iterate those and cancel explicitly. +3. **Bound the top-level `await actor_nursery + ._join_procs.wait()` shield in `subint_proc`** (same + pattern as the other bounded shields the hard-kill + patch added). If the nursery never sets `_join_procs` + because a child task is parked, the bound would at + least let the teardown proceed. + +Of these, (1) is the most surgical and directly addresses +the root cause. (2) is a defense-in-depth companion. (3) +is a band-aid but cheap to add. + +## Current workaround + +None in-tree. The test's `trio.fail_after()` bound +currently fires and raises `TooSlowError`, so the test +visibly **fails** rather than hangs — which is +intentional (an unbounded cancellation-audit test would +defeat itself). But in interactive test runs the operator +has to hit Ctrl-C to move past the parked state before +pytest reports the failure. + +## Reproducer + +``` +./py314/bin/python -m pytest \ + tests/test_subint_cancellation.py::test_subint_non_checkpointing_child \ + --spawn-backend=subint --tb=short --no-header -v +``` + +Expected: hangs until `trio.fail_after(15)` fires, or +Ctrl-C unwedges it manually. + +## References + +- `tractor.spawn._subint.subint_proc` — current subint + teardown code; see the `_HARD_KILL_TIMEOUT` bounded + shields + `daemon=True` driver-thread abandonment + (commit `b025c982`). +- `ai/conc-anal/subint_sigint_starvation_issue.md` — the + sibling CPython-level hang (GIL-starvation, + SIGINT-unresponsive) which is **not** this issue. +- Phase B tracking: issue #379. diff --git a/ai/conc-anal/subint_sigint_starvation_issue.md b/ai/conc-anal/subint_sigint_starvation_issue.md new file mode 100644 index 00000000..438b7b8a --- /dev/null +++ b/ai/conc-anal/subint_sigint_starvation_issue.md @@ -0,0 +1,205 @@ +# `subint` backend: abandoned-subint thread can wedge main trio event loop (Ctrl-C unresponsive) + +Follow-up to the Phase B subint spawn-backend PR (see +`tractor.spawn._subint`, issue #379). The hard-kill escape +hatch we landed (`_HARD_KILL_TIMEOUT`, bounded shields, +`daemon=True` driver-thread abandonment) handles *most* +stuck-subint scenarios cleanly, but there's one class of +hang that can't be fully escaped from within tractor: a +still-running abandoned sub-interpreter can starve the +**parent's** trio event loop to the point where **SIGINT is +effectively dropped by the kernel ↔ Python boundary** — +making the pytest process un-Ctrl-C-able. + +## Symptom + +Running `test_stale_entry_is_deleted[subint]` under +`--spawn-backend=subint`: + +1. Test spawns a subactor (`transport_fails_actor`) which + kills its own IPC server and then + `trio.sleep_forever()`. +2. Parent tries `Portal.cancel_actor()` → channel + disconnected → fast return. +3. Nursery teardown triggers our `subint_proc` cancel path. + Portal-cancel fails (dead channel), + `_HARD_KILL_TIMEOUT` fires, driver thread is abandoned + (`daemon=True`), `_interpreters.destroy(interp_id)` + raises `InterpreterError` (because the subint is still + running). +4. Test appears to hang indefinitely at the *outer* + `async with tractor.open_nursery() as an:` exit. +5. `Ctrl-C` at the terminal does nothing. The pytest + process is un-interruptable. + +## Evidence + +### `strace` on the hung pytest process + +``` +--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} --- +write(37, "\2", 1) = -1 EAGAIN (Resource temporarily unavailable) +rt_sigreturn({mask=[WINCH]}) = 140585542325792 +``` + +Translated: + +- Kernel delivers `SIGINT` to pytest. +- CPython's C-level signal handler fires and tries to + write the signal number byte (`0x02` = SIGINT) to fd 37 + — the **Python signal-wakeup fd** (set via + `signal.set_wakeup_fd()`, which trio uses to wake its + event loop on signals). +- Write returns `EAGAIN` — **the pipe is full**. Nothing + is draining it. +- `rt_sigreturn` with the signal masked off — signal is + "handled" from the kernel's perspective but the actual + Python-level handler (and therefore trio's + `KeyboardInterrupt` delivery) never runs. + +### Stack dump (via `tractor.devx.dump_on_hang`) + +At 20s into the hang, only the **main thread** is visible: + +``` +Thread 0x...7fdca0191780 [python] (most recent call first): + File ".../trio/_core/_io_epoll.py", line 245 in get_events + File ".../trio/_core/_run.py", line 2415 in run + File ".../tests/discovery/test_registrar.py", line 575 in test_stale_entry_is_deleted + ... +``` + +No driver thread shows up. The abandoned-legacy-subint +thread still exists from the OS's POV (it's still running +inside `_interpreters.exec()` driving the subint's +`trio.run()` on `trio.sleep_forever()`) but the **main +interp's faulthandler can't see threads currently executing +inside a sub-interpreter's tstate**. Concretely: the thread +is alive, holding state we can't introspect from here. + +## Root cause analysis + +The most consistent explanation for both observations: + +1. **Legacy-config subinterpreters share the main GIL.** + PEP 734's public `concurrent.interpreters.create()` + defaults to `'isolated'` (per-interp GIL), but tractor + uses `_interpreters.create('legacy')` as a workaround + for C extensions that don't yet support PEP 684 + (notably `msgspec`, see + [jcrist/msgspec#563](https://github.com/jcrist/msgspec/issues/563)). + Legacy-mode subints share process-global state + including the GIL. + +2. **Our abandoned subint thread never exits.** After our + hard-kill timeout, `driver_thread.join()` is abandoned + via `abandon_on_cancel=True` and the thread is + `daemon=True` so proc-exit won't block on it — but the + thread *itself* is still alive inside + `_interpreters.exec()`, driving a `trio.run()` that + will never return (the subint actor is in + `trio.sleep_forever()`). + +3. **`_interpreters.destroy()` cannot force-stop a running + subint.** It raises `InterpreterError` on any + still-running subinterpreter; there is no public + CPython API to force-destroy one. + +4. **Shared-GIL + non-terminating subint thread → main + trio loop starvation.** Under enough load (the subint's + trio event loop iterating in the background, IPC-layer + tasks still in the subint, etc.) the main trio event + loop can fail to iterate frequently enough to drain its + wakeup pipe. Once that pipe fills, `SIGINT` writes from + the C signal handler return `EAGAIN` and signals are + silently dropped — exactly what `strace` shows. + +The shielded +`await actor_nursery._join_procs.wait()` at the top of +`subint_proc` (inherited unchanged from the `trio_proc` +pattern) is structurally involved too: if main trio *does* +get a schedule slice, it'd find the `subint_proc` task +parked on `_join_procs` under shield — which traps whatever +`Cancelled` arrives. But that's a second-order effect; the +signal-pipe-full condition is the primary "Ctrl-C doesn't +work" cause. + +## Why we can't fix this from inside tractor + +- **No force-destroy API.** CPython provides neither a + `_interpreters.force_destroy()` nor a thread- + cancellation primitive (`pthread_cancel` is actively + discouraged and unavailable on Windows). A subint stuck + in pure-Python loops (or worse, C code that doesn't poll + for signals) is structurally unreachable from outside. +- **Shared GIL is the root scheduling issue.** As long as + we're forced into legacy-mode subints for `msgspec` + compatibility, the abandoned-thread scenario is + fundamentally a process-global GIL-starvation window. +- **`signal.set_wakeup_fd()` is process-global.** Even if + we wanted to put our own drainer on the wakeup pipe, + only one party owns it at a time. + +## Current workaround + +- **Fixture-side SIGINT loop on the `daemon` subproc** (in + this test's `daemon: subprocess.Popen` fixture in + `tests/conftest.py`). The daemon dying closes its end of + the registry IPC, which unblocks a pending recv in main + trio's IPC-server task, which lets the event loop + iterate, which drains the wakeup pipe, which finally + delivers the test-harness SIGINT. +- **Module-level skip on py3.13** + (`pytest.importorskip('concurrent.interpreters')`) — the + private `_interpreters` C module exists on 3.13 but the + multi-trio-task interaction hangs silently there + independently of this issue. + +## Path forward + +1. **Primary**: upstream `msgspec` PEP 684 adoption + ([jcrist/msgspec#563](https://github.com/jcrist/msgspec/issues/563)). + Unlocks `concurrent.interpreters.create()` isolated + mode → per-interp GIL → abandoned subint threads no + longer starve the parent's main trio loop. At that + point we can flip `_subint.py` back to the public API + (`create()` / `Interpreter.exec()` / `Interpreter.close()`) + and drop the private `_interpreters` path. + +2. **Secondary**: watch CPython for a public + force-destroy primitive. If something like + `Interpreter.close(force=True)` lands, we can use it as + a hard-kill final stage and actually tear down + abandoned subints. + +3. **Harness-level**: document the fixture-side SIGINT + loop pattern as the "known workaround" for subint- + backend tests that can leave background state holding + the main event loop hostage. + +## References + +- PEP 734 (`concurrent.interpreters`): + +- PEP 684 (per-interpreter GIL): + +- `msgspec` PEP 684 tracker: + +- CPython `_interpretersmodule.c` source: + +- `tractor.spawn._subint` module docstring (in-tree + explanation of the legacy-mode choice and its + tradeoffs). + +## Reproducer + +``` +./py314/bin/python -m pytest \ + tests/discovery/test_registrar.py::test_stale_entry_is_deleted \ + --spawn-backend=subint \ + --tb=short --no-header -v +``` + +Hangs indefinitely without the fixture-side SIGINT loop; +with the loop, the test completes (albeit with the +abandoned-thread warning in logs). diff --git a/tests/discovery/test_registrar.py b/tests/discovery/test_registrar.py index 60b2b10c..6f34b117 100644 --- a/tests/discovery/test_registrar.py +++ b/tests/discovery/test_registrar.py @@ -14,6 +14,7 @@ import psutil import pytest import subprocess import tractor +from tractor.devx import dump_on_hang from tractor.trionics import collapse_eg from tractor._testing import tractor_test from tractor.discovery._addr import wrap_address @@ -562,4 +563,53 @@ def test_stale_entry_is_deleted( await ptl.cancel_actor() await an.cancel() - trio.run(main) + # TODO, remove once the `[subint]` variant no longer hangs. + # + # Status (as of Phase B hard-kill landing): + # + # - `[trio]`/`[mp_*]` variants: completes normally; `dump_on_hang` + # is a no-op safety net here. + # + # - `[subint]` variant: hangs indefinitely AND is un-Ctrl-C-able. + # `strace -p ` while in the hang reveals a silently- + # dropped SIGINT — the C signal handler tries to write the + # signum byte to Python's signal-wakeup fd and gets `EAGAIN`, + # meaning the pipe is full (nobody's draining it). + # + # Root-cause chain: our hard-kill in `spawn._subint` abandoned + # the driver OS-thread (which is `daemon=True`) after the soft- + # kill timeout, but the *sub-interpreter* inside that thread is + # still running `trio.run()` — `_interpreters.destroy()` can't + # force-stop a running subint (raises `InterpreterError`), and + # legacy-config subints share the main GIL. The abandoned subint + # starves the parent's trio event loop from iterating often + # enough to drain its wakeup pipe → SIGINT silently drops. + # + # This is structurally a CPython-level limitation: there's no + # public force-destroy primitive for a running subint. We + # escape on the harness side via a SIGINT-loop in the `daemon` + # fixture teardown (killing the bg registrar subproc closes its + # end of the IPC, which eventually unblocks a recv in main trio, + # which lets the loop drain the wakeup pipe). Long-term fix path: + # msgspec PEP 684 support (jcrist/msgspec#563) → isolated-mode + # subints with per-interp GIL. + # + # Full analysis: + # `ai/conc-anal/subint_sigint_starvation_issue.md` + # + # See also the *sibling* hang class documented in + # `ai/conc-anal/subint_cancel_delivery_hang_issue.md` — same + # subint backend, different root cause (Ctrl-C-able hang, main + # trio loop iterating fine; ours to fix, not CPython's). + # Reproduced by `tests/test_subint_cancellation.py + # ::test_subint_non_checkpointing_child`. + # + # Kept here (and not behind a `pytestmark.skip`) so we can still + # inspect the dump file if the hang ever returns after a refactor. + # `pytest`'s stderr capture eats `faulthandler` output otherwise, + # so we route `dump_on_hang` to a file. + with dump_on_hang( + seconds=20, + path=f'/tmp/test_stale_entry_is_deleted_{start_method}.dump', + ): + trio.run(main) diff --git a/tests/test_subint_cancellation.py b/tests/test_subint_cancellation.py index 67523b8c..04b6cc9e 100644 --- a/tests/test_subint_cancellation.py +++ b/tests/test_subint_cancellation.py @@ -182,6 +182,32 @@ def test_subint_non_checkpointing_child( - ~3s: `_HARD_KILL_TIMEOUT` (thread-join wait) - margin + KNOWN ISSUE (Ctrl-C-able hang): + ------------------------------- + This test currently hangs past the hard-kill timeout for + reasons unrelated to the subint teardown itself — after + the subint is destroyed, a parent-side trio task appears + to park on an orphaned IPC channel (no clean EOF + delivered to a waiting receive). Unlike the + SIGINT-starvation sibling case in + `test_stale_entry_is_deleted`, this hang IS Ctrl-C-able + (`strace` shows SIGINT wakeup-fd `write() = 1`, not + `EAGAIN`) — i.e. the main trio loop is still iterating + normally. That makes this *our* bug to fix, not a + CPython-level limitation. + + See `ai/conc-anal/subint_cancel_delivery_hang_issue.md` + for the full analysis + candidate fix directions + (explicit parent-side channel abort in `subint_proc` + teardown being the most likely surgical fix). + + The sibling `ai/conc-anal/subint_sigint_starvation_issue.md` + documents the *other* hang class (abandoned-legacy-subint + thread + shared-GIL starvation → signal-wakeup-fd pipe + fills → SIGINT silently dropped) — that one is + structurally blocked on msgspec PEP 684 adoption and is + NOT what this test is hitting. + ''' deadline: float = 15.0 with dump_on_hang(