Doc `subint` backend hang classes + arm `dump_on_hang`

Classify and write up the two distinct hang modes hit during Phase B subint bringup (issue #379) so future triage doesn't re-derive them from scratch. Deats, two new `ai/conc-anal/` docs, - `subint_sigint_starvation_issue.md`: abandoned legacy-subint thread + shared GIL → main trio loop starves → signal-wakeup-fd pipe fills → `SIGINT` silently dropped (`strace` shows `write() = EAGAIN` on the wakeup-fd). Un- Ctrl-C-able. Structurally a CPython limit; blocked on `msgspec` PEP 684 (jcrist/msgspec#563) - `subint_cancel_delivery_hang_issue.md`: parent-side trio task parks on an orphaned IPC channel after subint teardown — no clean EOF delivered to the waiting receive. Ctrl-C-able (main loop iterates fine); OUR bug to fix. Candidate fix: explicit parent-side channel abort in `subint_proc`'s hard-kill teardown Cross-link the docs from their test reproducers, - `test_stale_entry_is_deleted` (→ starvation class): wrap `trio.run(main)` in `dump_on_hang(seconds=20)` so a future regression captures a stack dump. Kept un- skipped so the dump file is inspectable - `test_subint_non_checkpointing_child` (→ delivery class): extend docstring with a "KNOWN ISSUE" block pointing at the analysis (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-20 15:28:00 -04:00 · 2026-04-20 15:28:00 -04:00 · 4a3254583b
parent 2ed5e6a6e8
commit 4a3254583b
4 changed files with 443 additions and 1 deletions
--- a/ai/conc-anal/subint_cancel_delivery_hang_issue.md
+++ b/ai/conc-anal/subint_cancel_delivery_hang_issue.md
@ -0,0 +1,161 @@
 # `subint` backend: parent trio loop parks after subint teardown (Ctrl-C works; not a CPython-level issue)
 Follow-up to the Phase B subint spawn-backend PR (see
 `tractor.spawn._subint`, issue #379). Distinct from the
 `subint_sigint_starvation_issue.md` (SIGINT-unresponsive
 starvation hang): this one is **Ctrl-C-able**, which means
 it's *not* the shared-GIL-hostage class and is ours to fix
 from inside tractor rather than waiting on upstream CPython
 / msgspec progress.
 ## TL;DR
 After a stuck-subint subactor is torn down via the
 hard-kill path, a parent-side trio task parks on an
 *orphaned resource* (most likely a `chan.recv()` /
 `process_messages` loop on the now-dead subint's IPC
 channel) and waits forever for bytes that can't arrive —
 because the channel was torn down without emitting a clean
 EOF/`BrokenResourceError` to the waiting receiver.
 Unlike `subint_sigint_starvation_issue.md`, the main trio
 loop **is** iterating normally — SIGINT delivers cleanly
 and the test unhangs. But absent Ctrl-C, the test suite
 wedges indefinitely.
 ## Symptom
 Running `test_subint_non_checkpointing_child` under
 `--spawn-backend=subint` (in
 `tests/test_subint_cancellation.py`):
 1. Test spawns a subactor whose main task runs
   `threading.Event.wait(1.0)` in a loop — releases the
   GIL but never inserts a trio checkpoint.
 2. Parent does `an.cancel_scope.cancel()`. Our
   `subint_proc` cancel path fires: soft-kill sends
   `Portal.cancel_actor()` over the live IPC channel →
   subint's trio loop *should* process the cancel msg on
   its IPC dispatcher task (since the GIL releases are
   happening).
 3. Expected: subint's `trio.run()` unwinds, driver thread
   exits naturally, parent returns.
 4. Actual: parent `trio.run()` never completes. Test
   hangs past its `trio.fail_after()` deadline.
 ## Evidence
 ### `strace` on the hung pytest process during SIGINT
 ```
 --- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
 write(17, "\2", 1)                      = 1
 ```
 Contrast with the SIGINT-starvation hang (see
 `subint_sigint_starvation_issue.md`) where that same
 `write()` returned `EAGAIN`. Here the SIGINT byte is
 written successfully → Python's signal handler pipe is
 being drained → main trio loop **is** iterating → SIGINT
 gets turned into `trio.Cancelled` → the test unhangs (if
 the operator happens to be there to hit Ctrl-C).
 ### Stack dump (via `tractor.devx.dump_on_hang`)
 Single main thread visible, parked in
 `trio._core._io_epoll.get_events` inside `trio.run` at the
 test's `trio.run(...)` call site. No subint driver thread
 (subint was destroyed successfully — this is *after* the
 hard-kill path, not during it).
 ## Root cause hypothesis
 Most consistent with the evidence: a parent-side trio
 task is awaiting a `chan.recv()` / `process_messages` loop
 on the dead subint's IPC channel. The sequence:
 1. Soft-kill in `subint_proc` sends `Portal.cancel_actor()`
   over the channel. The subint's trio dispatcher *may* or
   may not have processed the cancel msg before the subint
   was destroyed — timing-dependent.
 2. Hard-kill timeout fires (because the subint's main
   task was in `threading.Event.wait()` with no trio
   checkpoint — cancel-msg processing couldn't race the
   timeout).
 3. Driver thread abandoned, `_interpreters.destroy()`
   runs. Subint is gone.
 4. But the parent-side trio task holding a
   `chan.recv()` / `process_messages` loop against that
   channel was **not** explicitly cancelled. The channel's
   underlying socket got torn down, but without a clean
   EOF delivered to the waiting recv, the task parks
   forever on `trio.lowlevel.wait_readable` (or similar).
 This matches the "main loop fine, task parked on
 orphaned I/O" signature.
 ## Why this is ours to fix (not CPython's)
 - Main trio loop iterates normally → GIL isn't starved.
 - SIGINT is deliverable → not a signal-pipe-full /
  wakeup-fd contention scenario.
 - The hang is in *our* supervision code, specifically in
  how `subint_proc` tears down its side of the IPC when
  the subint is abandoned/destroyed.
 ## Possible fix directions
 1. **Explicit parent-side channel abort on subint
   abandon.** In `subint_proc`'s teardown block, after the
   hard-kill timeout fires, explicitly close the parent's
   end of the IPC channel to the subint. Any waiting
   `chan.recv()` / `process_messages` task sees
   `BrokenResourceError` (or `ClosedResourceError`) and
   unwinds.
 2. **Cancel parent-side RPC tasks tied to the dead
   subint's channel.** The `Actor._rpc_tasks` / nursery
   machinery should have a handle on any
   `process_messages` loops bound to a specific peer
   channel. Iterate those and cancel explicitly.
 3. **Bound the top-level `await actor_nursery
   ._join_procs.wait()` shield in `subint_proc`** (same
   pattern as the other bounded shields the hard-kill
   patch added). If the nursery never sets `_join_procs`
   because a child task is parked, the bound would at
   least let the teardown proceed.
 Of these, (1) is the most surgical and directly addresses
 the root cause. (2) is a defense-in-depth companion. (3)
 is a band-aid but cheap to add.
 ## Current workaround
 None in-tree. The test's `trio.fail_after()` bound
 currently fires and raises `TooSlowError`, so the test
 visibly **fails** rather than hangs — which is
 intentional (an unbounded cancellation-audit test would
 defeat itself). But in interactive test runs the operator
 has to hit Ctrl-C to move past the parked state before
 pytest reports the failure.
 ## Reproducer
 ```
 ./py314/bin/python -m pytest \
  tests/test_subint_cancellation.py::test_subint_non_checkpointing_child \
  --spawn-backend=subint --tb=short --no-header -v
 ```
 Expected: hangs until `trio.fail_after(15)` fires, or
 Ctrl-C unwedges it manually.
 ## References
 - `tractor.spawn._subint.subint_proc` — current subint
  teardown code; see the `_HARD_KILL_TIMEOUT` bounded
  shields + `daemon=True` driver-thread abandonment
  (commit `b025c982`).
 - `ai/conc-anal/subint_sigint_starvation_issue.md` — the
  sibling CPython-level hang (GIL-starvation,
  SIGINT-unresponsive) which is **not** this issue.
 - Phase B tracking: issue #379.
--- a/ai/conc-anal/subint_sigint_starvation_issue.md
+++ b/ai/conc-anal/subint_sigint_starvation_issue.md
@ -0,0 +1,205 @@
 # `subint` backend: abandoned-subint thread can wedge main trio event loop (Ctrl-C unresponsive)
 Follow-up to the Phase B subint spawn-backend PR (see
 `tractor.spawn._subint`, issue #379). The hard-kill escape
 hatch we landed (`_HARD_KILL_TIMEOUT`, bounded shields,
 `daemon=True` driver-thread abandonment) handles *most*
 stuck-subint scenarios cleanly, but there's one class of
 hang that can't be fully escaped from within tractor: a
 still-running abandoned sub-interpreter can starve the
 **parent's** trio event loop to the point where **SIGINT is
 effectively dropped by the kernel ↔ Python boundary** —
 making the pytest process un-Ctrl-C-able.
 ## Symptom
 Running `test_stale_entry_is_deleted[subint]` under
 `--spawn-backend=subint`:
 1. Test spawns a subactor (`transport_fails_actor`) which
   kills its own IPC server and then
   `trio.sleep_forever()`.
 2. Parent tries `Portal.cancel_actor()` → channel
   disconnected → fast return.
 3. Nursery teardown triggers our `subint_proc` cancel path.
   Portal-cancel fails (dead channel),
   `_HARD_KILL_TIMEOUT` fires, driver thread is abandoned
   (`daemon=True`), `_interpreters.destroy(interp_id)`
   raises `InterpreterError` (because the subint is still
   running).
 4. Test appears to hang indefinitely at the *outer*
   `async with tractor.open_nursery() as an:` exit.
 5. `Ctrl-C` at the terminal does nothing. The pytest
   process is un-interruptable.
 ## Evidence
 ### `strace` on the hung pytest process
 ```
 --- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
 write(37, "\2", 1) = -1 EAGAIN (Resource temporarily unavailable)
 rt_sigreturn({mask=[WINCH]}) = 140585542325792
 ```
 Translated:
 - Kernel delivers `SIGINT` to pytest.
 - CPython's C-level signal handler fires and tries to
  write the signal number byte (`0x02` = SIGINT) to fd 37
  — the **Python signal-wakeup fd** (set via
  `signal.set_wakeup_fd()`, which trio uses to wake its
  event loop on signals).
 - Write returns `EAGAIN` — **the pipe is full**. Nothing
  is draining it.
 - `rt_sigreturn` with the signal masked off — signal is
  "handled" from the kernel's perspective but the actual
  Python-level handler (and therefore trio's
  `KeyboardInterrupt` delivery) never runs.
 ### Stack dump (via `tractor.devx.dump_on_hang`)
 At 20s into the hang, only the **main thread** is visible:
 ```
 Thread 0x...7fdca0191780 [python] (most recent call first):
  File ".../trio/_core/_io_epoll.py", line 245 in get_events
  File ".../trio/_core/_run.py", line 2415 in run
  File ".../tests/discovery/test_registrar.py", line 575 in test_stale_entry_is_deleted
  ...
 ```
 No driver thread shows up. The abandoned-legacy-subint
 thread still exists from the OS's POV (it's still running
 inside `_interpreters.exec()` driving the subint's
 `trio.run()` on `trio.sleep_forever()`) but the **main
 interp's faulthandler can't see threads currently executing
 inside a sub-interpreter's tstate**. Concretely: the thread
 is alive, holding state we can't introspect from here.
 ## Root cause analysis
 The most consistent explanation for both observations:
 1. **Legacy-config subinterpreters share the main GIL.**
   PEP 734's public `concurrent.interpreters.create()`
   defaults to `'isolated'` (per-interp GIL), but tractor
   uses `_interpreters.create('legacy')` as a workaround
   for C extensions that don't yet support PEP 684
   (notably `msgspec`, see
   [jcrist/msgspec#563](https://github.com/jcrist/msgspec/issues/563)).
   Legacy-mode subints share process-global state
   including the GIL.
 2. **Our abandoned subint thread never exits.** After our
   hard-kill timeout, `driver_thread.join()` is abandoned
   via `abandon_on_cancel=True` and the thread is
   `daemon=True` so proc-exit won't block on it — but the
   thread *itself* is still alive inside
   `_interpreters.exec()`, driving a `trio.run()` that
   will never return (the subint actor is in
   `trio.sleep_forever()`).
 3. **`_interpreters.destroy()` cannot force-stop a running
   subint.** It raises `InterpreterError` on any
   still-running subinterpreter; there is no public
   CPython API to force-destroy one.
 4. **Shared-GIL + non-terminating subint thread → main
   trio loop starvation.** Under enough load (the subint's
   trio event loop iterating in the background, IPC-layer
   tasks still in the subint, etc.) the main trio event
   loop can fail to iterate frequently enough to drain its
   wakeup pipe. Once that pipe fills, `SIGINT` writes from
   the C signal handler return `EAGAIN` and signals are
   silently dropped — exactly what `strace` shows.
 The shielded
 `await actor_nursery._join_procs.wait()` at the top of
 `subint_proc` (inherited unchanged from the `trio_proc`
 pattern) is structurally involved too: if main trio *does*
 get a schedule slice, it'd find the `subint_proc` task
 parked on `_join_procs` under shield — which traps whatever
 `Cancelled` arrives. But that's a second-order effect; the
 signal-pipe-full condition is the primary "Ctrl-C doesn't
 work" cause.
 ## Why we can't fix this from inside tractor
 - **No force-destroy API.** CPython provides neither a
  `_interpreters.force_destroy()` nor a thread-
  cancellation primitive (`pthread_cancel` is actively
  discouraged and unavailable on Windows). A subint stuck
  in pure-Python loops (or worse, C code that doesn't poll
  for signals) is structurally unreachable from outside.
 - **Shared GIL is the root scheduling issue.** As long as
  we're forced into legacy-mode subints for `msgspec`
  compatibility, the abandoned-thread scenario is
  fundamentally a process-global GIL-starvation window.
 - **`signal.set_wakeup_fd()` is process-global.** Even if
  we wanted to put our own drainer on the wakeup pipe,
  only one party owns it at a time.
 ## Current workaround
 - **Fixture-side SIGINT loop on the `daemon` subproc** (in
  this test's `daemon: subprocess.Popen` fixture in
  `tests/conftest.py`). The daemon dying closes its end of
  the registry IPC, which unblocks a pending recv in main
  trio's IPC-server task, which lets the event loop
  iterate, which drains the wakeup pipe, which finally
  delivers the test-harness SIGINT.
 - **Module-level skip on py3.13**
  (`pytest.importorskip('concurrent.interpreters')`) — the
  private `_interpreters` C module exists on 3.13 but the
  multi-trio-task interaction hangs silently there
  independently of this issue.
 ## Path forward
 1. **Primary**: upstream `msgspec` PEP 684 adoption
   ([jcrist/msgspec#563](https://github.com/jcrist/msgspec/issues/563)).
   Unlocks `concurrent.interpreters.create()` isolated
   mode → per-interp GIL → abandoned subint threads no
   longer starve the parent's main trio loop. At that
   point we can flip `_subint.py` back to the public API
   (`create()` / `Interpreter.exec()` / `Interpreter.close()`)
   and drop the private `_interpreters` path.
 2. **Secondary**: watch CPython for a public
   force-destroy primitive. If something like
   `Interpreter.close(force=True)` lands, we can use it as
   a hard-kill final stage and actually tear down
   abandoned subints.
 3. **Harness-level**: document the fixture-side SIGINT
   loop pattern as the "known workaround" for subint-
   backend tests that can leave background state holding
   the main event loop hostage.
 ## References
 - PEP 734 (`concurrent.interpreters`):
  <https://peps.python.org/pep-0734/>
 - PEP 684 (per-interpreter GIL):
  <https://peps.python.org/pep-0684/>
 - `msgspec` PEP 684 tracker:
  <https://github.com/jcrist/msgspec/issues/563>
 - CPython `_interpretersmodule.c` source:
  <https://github.com/python/cpython/blob/main/Modules/_interpretersmodule.c>
 - `tractor.spawn._subint` module docstring (in-tree
  explanation of the legacy-mode choice and its
  tradeoffs).
 ## Reproducer
 ```
 ./py314/bin/python -m pytest \
  tests/discovery/test_registrar.py::test_stale_entry_is_deleted \
  --spawn-backend=subint \
  --tb=short --no-header -v
 ```
 Hangs indefinitely without the fixture-side SIGINT loop;
 with the loop, the test completes (albeit with the
 abandoned-thread warning in logs).
--- a/tests/discovery/test_registrar.py
+++ b/tests/discovery/test_registrar.py
@ -14,6 +14,7 @@ import psutil
 import pytest
 import subprocess
 import tractor
 from tractor.devx import dump_on_hang
 from tractor.trionics import collapse_eg
 from tractor._testing import tractor_test
 from tractor.discovery._addr import wrap_address
@ -562,4 +563,53 @@ def test_stale_entry_is_deleted(
                await ptl.cancel_actor()
                await an.cancel()
-    trio.run(main)
+    # TODO, remove once the `[subint]` variant no longer hangs.
    #
    # Status (as of Phase B hard-kill landing):
    #
    # - `[trio]`/`[mp_*]` variants: completes normally; `dump_on_hang`
    #   is a no-op safety net here.
    #
    # - `[subint]` variant: hangs indefinitely AND is un-Ctrl-C-able.
    #   `strace -p <pytest_pid>` while in the hang reveals a silently-
    #   dropped SIGINT — the C signal handler tries to write the
    #   signum byte to Python's signal-wakeup fd and gets `EAGAIN`,
    #   meaning the pipe is full (nobody's draining it).
    #
    #   Root-cause chain: our hard-kill in `spawn._subint` abandoned
    #   the driver OS-thread (which is `daemon=True`) after the soft-
    #   kill timeout, but the *sub-interpreter* inside that thread is
    #   still running `trio.run()` — `_interpreters.destroy()` can't
    #   force-stop a running subint (raises `InterpreterError`), and
    #   legacy-config subints share the main GIL. The abandoned subint
    #   starves the parent's trio event loop from iterating often
    #   enough to drain its wakeup pipe → SIGINT silently drops.
    #
    #   This is structurally a CPython-level limitation: there's no
    #   public force-destroy primitive for a running subint. We
    #   escape on the harness side via a SIGINT-loop in the `daemon`
    #   fixture teardown (killing the bg registrar subproc closes its
    #   end of the IPC, which eventually unblocks a recv in main trio,
    #   which lets the loop drain the wakeup pipe). Long-term fix path:
    #   msgspec PEP 684 support (jcrist/msgspec#563) → isolated-mode
    #   subints with per-interp GIL.
    #
    #   Full analysis:
    #   `ai/conc-anal/subint_sigint_starvation_issue.md`
    #
    #   See also the *sibling* hang class documented in
    #   `ai/conc-anal/subint_cancel_delivery_hang_issue.md` — same
    #   subint backend, different root cause (Ctrl-C-able hang, main
    #   trio loop iterating fine; ours to fix, not CPython's).
    #   Reproduced by `tests/test_subint_cancellation.py
    #   ::test_subint_non_checkpointing_child`.
    #
    # Kept here (and not behind a `pytestmark.skip`) so we can still
    # inspect the dump file if the hang ever returns after a refactor.
    # `pytest`'s stderr capture eats `faulthandler` output otherwise,
    # so we route `dump_on_hang` to a file.
    with dump_on_hang(
        seconds=20,
        path=f'/tmp/test_stale_entry_is_deleted_{start_method}.dump',
    ):
        trio.run(main)
--- a/tests/test_subint_cancellation.py
+++ b/tests/test_subint_cancellation.py
@ -182,6 +182,32 @@ def test_subint_non_checkpointing_child(
    - ~3s: `_HARD_KILL_TIMEOUT` (thread-join wait)
    - margin
    KNOWN ISSUE (Ctrl-C-able hang):
    -------------------------------
    This test currently hangs past the hard-kill timeout for
    reasons unrelated to the subint teardown itself — after
    the subint is destroyed, a parent-side trio task appears
    to park on an orphaned IPC channel (no clean EOF
    delivered to a waiting receive). Unlike the
    SIGINT-starvation sibling case in
    `test_stale_entry_is_deleted`, this hang IS Ctrl-C-able
    (`strace` shows SIGINT wakeup-fd `write() = 1`, not
    `EAGAIN`) — i.e. the main trio loop is still iterating
    normally. That makes this *our* bug to fix, not a
    CPython-level limitation.
    See `ai/conc-anal/subint_cancel_delivery_hang_issue.md`
    for the full analysis + candidate fix directions
    (explicit parent-side channel abort in `subint_proc`
    teardown being the most likely surgical fix).
    The sibling `ai/conc-anal/subint_sigint_starvation_issue.md`
    documents the *other* hang class (abandoned-legacy-subint
    thread + shared-GIL starvation → signal-wakeup-fd pipe
    fills → SIGINT silently dropped) — that one is
    structurally blocked on msgspec PEP 684 adoption and is
    NOT what this test is hitting.
    '''
    deadline: float = 15.0
    with dump_on_hang(