From 35796ec8ae4a6db3b892adc97fc8733aa8938759 Mon Sep 17 00:00:00 2001
From: goodboy <goodboy_foss@protonmail.com>
Date: Mon, 20 Apr 2026 15:28:00 -0400
Subject: [PATCH] Doc `subint` backend hang classes + arm `dump_on_hang`
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Classify and write up the two distinct hang modes hit during Phase
B subint bringup (issue #379) so future triage doesn't re-derive them
from scratch.

Deats, two new `ai/conc-anal/` docs,
- `subint_sigint_starvation_issue.md`: abandoned legacy-subint thread
  + shared GIL → main trio loop starves → signal-wakeup-fd pipe fills
  → `SIGINT` silently dropped (`strace` shows `write() = EAGAIN` on the
  wakeup-fd). Un- Ctrl-C-able. Structurally a CPython limit; blocked on
  `msgspec` PEP 684 (jcrist/msgspec#563)

- `subint_cancel_delivery_hang_issue.md`: parent-side trio task parks on
  an orphaned IPC channel after subint teardown — no clean EOF delivered
  to the waiting receive. Ctrl-C-able (main loop iterates fine); OUR bug
  to fix. Candidate fix: explicit parent-side channel abort in
  `subint_proc`'s hard-kill teardown

Cross-link the docs from their test reproducers,
- `test_stale_entry_is_deleted` (→ starvation class): wrap
  `trio.run(main)` in `dump_on_hang(seconds=20)` so a future regression
  captures a stack dump. Kept un- skipped so the dump file is
  inspectable

- `test_subint_non_checkpointing_child` (→ delivery class): extend
  docstring with a "KNOWN ISSUE" block pointing at the analysis

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
---
 .../subint_cancel_delivery_hang_issue.md      | 161 ++++++++++++++
 .../subint_sigint_starvation_issue.md         | 205 ++++++++++++++++++
 tests/discovery/test_registrar.py             |  52 ++++-
 tests/test_subint_cancellation.py             |  26 +++
 4 files changed, 443 insertions(+), 1 deletion(-)
 create mode 100644 ai/conc-anal/subint_cancel_delivery_hang_issue.md
 create mode 100644 ai/conc-anal/subint_sigint_starvation_issue.md

diff --git a/ai/conc-anal/subint_cancel_delivery_hang_issue.md b/ai/conc-anal/subint_cancel_delivery_hang_issue.md
new file mode 100644
index 00000000..4c3112ed
--- /dev/null
+++ b/ai/conc-anal/subint_cancel_delivery_hang_issue.md
@@ -0,0 +1,161 @@
+# `subint` backend: parent trio loop parks after subint teardown (Ctrl-C works; not a CPython-level issue)
+
+Follow-up to the Phase B subint spawn-backend PR (see
+`tractor.spawn._subint`, issue #379). Distinct from the
+`subint_sigint_starvation_issue.md` (SIGINT-unresponsive
+starvation hang): this one is **Ctrl-C-able**, which means
+it's *not* the shared-GIL-hostage class and is ours to fix
+from inside tractor rather than waiting on upstream CPython
+/ msgspec progress.
+
+## TL;DR
+
+After a stuck-subint subactor is torn down via the
+hard-kill path, a parent-side trio task parks on an
+*orphaned resource* (most likely a `chan.recv()` /
+`process_messages` loop on the now-dead subint's IPC
+channel) and waits forever for bytes that can't arrive —
+because the channel was torn down without emitting a clean
+EOF/`BrokenResourceError` to the waiting receiver.
+
+Unlike `subint_sigint_starvation_issue.md`, the main trio
+loop **is** iterating normally — SIGINT delivers cleanly
+and the test unhangs. But absent Ctrl-C, the test suite
+wedges indefinitely.
+
+## Symptom
+
+Running `test_subint_non_checkpointing_child` under
+`--spawn-backend=subint` (in
+`tests/test_subint_cancellation.py`):
+
+1. Test spawns a subactor whose main task runs
+   `threading.Event.wait(1.0)` in a loop — releases the
+   GIL but never inserts a trio checkpoint.
+2. Parent does `an.cancel_scope.cancel()`. Our
+   `subint_proc` cancel path fires: soft-kill sends
+   `Portal.cancel_actor()` over the live IPC channel →
+   subint's trio loop *should* process the cancel msg on
+   its IPC dispatcher task (since the GIL releases are
+   happening).
+3. Expected: subint's `trio.run()` unwinds, driver thread
+   exits naturally, parent returns.
+4. Actual: parent `trio.run()` never completes. Test
+   hangs past its `trio.fail_after()` deadline.
+
+## Evidence
+
+### `strace` on the hung pytest process during SIGINT
+
+```
+--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
+write(17, "\2", 1)                      = 1
+```
+
+Contrast with the SIGINT-starvation hang (see
+`subint_sigint_starvation_issue.md`) where that same
+`write()` returned `EAGAIN`. Here the SIGINT byte is
+written successfully → Python's signal handler pipe is
+being drained → main trio loop **is** iterating → SIGINT
+gets turned into `trio.Cancelled` → the test unhangs (if
+the operator happens to be there to hit Ctrl-C).
+
+### Stack dump (via `tractor.devx.dump_on_hang`)
+
+Single main thread visible, parked in
+`trio._core._io_epoll.get_events` inside `trio.run` at the
+test's `trio.run(...)` call site. No subint driver thread
+(subint was destroyed successfully — this is *after* the
+hard-kill path, not during it).
+
+## Root cause hypothesis
+
+Most consistent with the evidence: a parent-side trio
+task is awaiting a `chan.recv()` / `process_messages` loop
+on the dead subint's IPC channel. The sequence:
+
+1. Soft-kill in `subint_proc` sends `Portal.cancel_actor()`
+   over the channel. The subint's trio dispatcher *may* or
+   may not have processed the cancel msg before the subint
+   was destroyed — timing-dependent.
+2. Hard-kill timeout fires (because the subint's main
+   task was in `threading.Event.wait()` with no trio
+   checkpoint — cancel-msg processing couldn't race the
+   timeout).
+3. Driver thread abandoned, `_interpreters.destroy()`
+   runs. Subint is gone.
+4. But the parent-side trio task holding a
+   `chan.recv()` / `process_messages` loop against that
+   channel was **not** explicitly cancelled. The channel's
+   underlying socket got torn down, but without a clean
+   EOF delivered to the waiting recv, the task parks
+   forever on `trio.lowlevel.wait_readable` (or similar).
+
+This matches the "main loop fine, task parked on
+orphaned I/O" signature.
+
+## Why this is ours to fix (not CPython's)
+
+- Main trio loop iterates normally → GIL isn't starved.
+- SIGINT is deliverable → not a signal-pipe-full /
+  wakeup-fd contention scenario.
+- The hang is in *our* supervision code, specifically in
+  how `subint_proc` tears down its side of the IPC when
+  the subint is abandoned/destroyed.
+
+## Possible fix directions
+
+1. **Explicit parent-side channel abort on subint
+   abandon.** In `subint_proc`'s teardown block, after the
+   hard-kill timeout fires, explicitly close the parent's
+   end of the IPC channel to the subint. Any waiting
+   `chan.recv()` / `process_messages` task sees
+   `BrokenResourceError` (or `ClosedResourceError`) and
+   unwinds.
+2. **Cancel parent-side RPC tasks tied to the dead
+   subint's channel.** The `Actor._rpc_tasks` / nursery
+   machinery should have a handle on any
+   `process_messages` loops bound to a specific peer
+   channel. Iterate those and cancel explicitly.
+3. **Bound the top-level `await actor_nursery
+   ._join_procs.wait()` shield in `subint_proc`** (same
+   pattern as the other bounded shields the hard-kill
+   patch added). If the nursery never sets `_join_procs`
+   because a child task is parked, the bound would at
+   least let the teardown proceed.
+
+Of these, (1) is the most surgical and directly addresses
+the root cause. (2) is a defense-in-depth companion. (3)
+is a band-aid but cheap to add.
+
+## Current workaround
+
+None in-tree. The test's `trio.fail_after()` bound
+currently fires and raises `TooSlowError`, so the test
+visibly **fails** rather than hangs — which is
+intentional (an unbounded cancellation-audit test would
+defeat itself). But in interactive test runs the operator
+has to hit Ctrl-C to move past the parked state before
+pytest reports the failure.
+
+## Reproducer
+
+```
+./py314/bin/python -m pytest \
+  tests/test_subint_cancellation.py::test_subint_non_checkpointing_child \
+  --spawn-backend=subint --tb=short --no-header -v
+```
+
+Expected: hangs until `trio.fail_after(15)` fires, or
+Ctrl-C unwedges it manually.
+
+## References
+
+- `tractor.spawn._subint.subint_proc` — current subint
+  teardown code; see the `_HARD_KILL_TIMEOUT` bounded
+  shields + `daemon=True` driver-thread abandonment
+  (commit `b025c982`).
+- `ai/conc-anal/subint_sigint_starvation_issue.md` — the
+  sibling CPython-level hang (GIL-starvation,
+  SIGINT-unresponsive) which is **not** this issue.
+- Phase B tracking: issue #379.
diff --git a/ai/conc-anal/subint_sigint_starvation_issue.md b/ai/conc-anal/subint_sigint_starvation_issue.md
new file mode 100644
index 00000000..438b7b8a
--- /dev/null
+++ b/ai/conc-anal/subint_sigint_starvation_issue.md
@@ -0,0 +1,205 @@
+# `subint` backend: abandoned-subint thread can wedge main trio event loop (Ctrl-C unresponsive)
+
+Follow-up to the Phase B subint spawn-backend PR (see
+`tractor.spawn._subint`, issue #379). The hard-kill escape
+hatch we landed (`_HARD_KILL_TIMEOUT`, bounded shields,
+`daemon=True` driver-thread abandonment) handles *most*
+stuck-subint scenarios cleanly, but there's one class of
+hang that can't be fully escaped from within tractor: a
+still-running abandoned sub-interpreter can starve the
+**parent's** trio event loop to the point where **SIGINT is
+effectively dropped by the kernel ↔ Python boundary** —
+making the pytest process un-Ctrl-C-able.
+
+## Symptom
+
+Running `test_stale_entry_is_deleted[subint]` under
+`--spawn-backend=subint`:
+
+1. Test spawns a subactor (`transport_fails_actor`) which
+   kills its own IPC server and then
+   `trio.sleep_forever()`.
+2. Parent tries `Portal.cancel_actor()` → channel
+   disconnected → fast return.
+3. Nursery teardown triggers our `subint_proc` cancel path.
+   Portal-cancel fails (dead channel),
+   `_HARD_KILL_TIMEOUT` fires, driver thread is abandoned
+   (`daemon=True`), `_interpreters.destroy(interp_id)`
+   raises `InterpreterError` (because the subint is still
+   running).
+4. Test appears to hang indefinitely at the *outer*
+   `async with tractor.open_nursery() as an:` exit.
+5. `Ctrl-C` at the terminal does nothing. The pytest
+   process is un-interruptable.
+
+## Evidence
+
+### `strace` on the hung pytest process
+
+```
+--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
+write(37, "\2", 1) = -1 EAGAIN (Resource temporarily unavailable)
+rt_sigreturn({mask=[WINCH]}) = 140585542325792
+```
+
+Translated:
+
+- Kernel delivers `SIGINT` to pytest.
+- CPython's C-level signal handler fires and tries to
+  write the signal number byte (`0x02` = SIGINT) to fd 37
+  — the **Python signal-wakeup fd** (set via
+  `signal.set_wakeup_fd()`, which trio uses to wake its
+  event loop on signals).
+- Write returns `EAGAIN` — **the pipe is full**. Nothing
+  is draining it.
+- `rt_sigreturn` with the signal masked off — signal is
+  "handled" from the kernel's perspective but the actual
+  Python-level handler (and therefore trio's
+  `KeyboardInterrupt` delivery) never runs.
+
+### Stack dump (via `tractor.devx.dump_on_hang`)
+
+At 20s into the hang, only the **main thread** is visible:
+
+```
+Thread 0x...7fdca0191780 [python] (most recent call first):
+  File ".../trio/_core/_io_epoll.py", line 245 in get_events
+  File ".../trio/_core/_run.py", line 2415 in run
+  File ".../tests/discovery/test_registrar.py", line 575 in test_stale_entry_is_deleted
+  ...
+```
+
+No driver thread shows up. The abandoned-legacy-subint
+thread still exists from the OS's POV (it's still running
+inside `_interpreters.exec()` driving the subint's
+`trio.run()` on `trio.sleep_forever()`) but the **main
+interp's faulthandler can't see threads currently executing
+inside a sub-interpreter's tstate**. Concretely: the thread
+is alive, holding state we can't introspect from here.
+
+## Root cause analysis
+
+The most consistent explanation for both observations:
+
+1. **Legacy-config subinterpreters share the main GIL.**
+   PEP 734's public `concurrent.interpreters.create()`
+   defaults to `'isolated'` (per-interp GIL), but tractor
+   uses `_interpreters.create('legacy')` as a workaround
+   for C extensions that don't yet support PEP 684
+   (notably `msgspec`, see
+   [jcrist/msgspec#563](https://github.com/jcrist/msgspec/issues/563)).
+   Legacy-mode subints share process-global state
+   including the GIL.
+
+2. **Our abandoned subint thread never exits.** After our
+   hard-kill timeout, `driver_thread.join()` is abandoned
+   via `abandon_on_cancel=True` and the thread is
+   `daemon=True` so proc-exit won't block on it — but the
+   thread *itself* is still alive inside
+   `_interpreters.exec()`, driving a `trio.run()` that
+   will never return (the subint actor is in
+   `trio.sleep_forever()`).
+
+3. **`_interpreters.destroy()` cannot force-stop a running
+   subint.** It raises `InterpreterError` on any
+   still-running subinterpreter; there is no public
+   CPython API to force-destroy one.
+
+4. **Shared-GIL + non-terminating subint thread → main
+   trio loop starvation.** Under enough load (the subint's
+   trio event loop iterating in the background, IPC-layer
+   tasks still in the subint, etc.) the main trio event
+   loop can fail to iterate frequently enough to drain its
+   wakeup pipe. Once that pipe fills, `SIGINT` writes from
+   the C signal handler return `EAGAIN` and signals are
+   silently dropped — exactly what `strace` shows.
+
+The shielded
+`await actor_nursery._join_procs.wait()` at the top of
+`subint_proc` (inherited unchanged from the `trio_proc`
+pattern) is structurally involved too: if main trio *does*
+get a schedule slice, it'd find the `subint_proc` task
+parked on `_join_procs` under shield — which traps whatever
+`Cancelled` arrives. But that's a second-order effect; the
+signal-pipe-full condition is the primary "Ctrl-C doesn't
+work" cause.
+
+## Why we can't fix this from inside tractor
+
+- **No force-destroy API.** CPython provides neither a
+  `_interpreters.force_destroy()` nor a thread-
+  cancellation primitive (`pthread_cancel` is actively
+  discouraged and unavailable on Windows). A subint stuck
+  in pure-Python loops (or worse, C code that doesn't poll
+  for signals) is structurally unreachable from outside.
+- **Shared GIL is the root scheduling issue.** As long as
+  we're forced into legacy-mode subints for `msgspec`
+  compatibility, the abandoned-thread scenario is
+  fundamentally a process-global GIL-starvation window.
+- **`signal.set_wakeup_fd()` is process-global.** Even if
+  we wanted to put our own drainer on the wakeup pipe,
+  only one party owns it at a time.
+
+## Current workaround
+
+- **Fixture-side SIGINT loop on the `daemon` subproc** (in
+  this test's `daemon: subprocess.Popen` fixture in
+  `tests/conftest.py`). The daemon dying closes its end of
+  the registry IPC, which unblocks a pending recv in main
+  trio's IPC-server task, which lets the event loop
+  iterate, which drains the wakeup pipe, which finally
+  delivers the test-harness SIGINT.
+- **Module-level skip on py3.13**
+  (`pytest.importorskip('concurrent.interpreters')`) — the
+  private `_interpreters` C module exists on 3.13 but the
+  multi-trio-task interaction hangs silently there
+  independently of this issue.
+
+## Path forward
+
+1. **Primary**: upstream `msgspec` PEP 684 adoption
+   ([jcrist/msgspec#563](https://github.com/jcrist/msgspec/issues/563)).
+   Unlocks `concurrent.interpreters.create()` isolated
+   mode → per-interp GIL → abandoned subint threads no
+   longer starve the parent's main trio loop. At that
+   point we can flip `_subint.py` back to the public API
+   (`create()` / `Interpreter.exec()` / `Interpreter.close()`)
+   and drop the private `_interpreters` path.
+
+2. **Secondary**: watch CPython for a public
+   force-destroy primitive. If something like
+   `Interpreter.close(force=True)` lands, we can use it as
+   a hard-kill final stage and actually tear down
+   abandoned subints.
+
+3. **Harness-level**: document the fixture-side SIGINT
+   loop pattern as the "known workaround" for subint-
+   backend tests that can leave background state holding
+   the main event loop hostage.
+
+## References
+
+- PEP 734 (`concurrent.interpreters`):
+  <https://peps.python.org/pep-0734/>
+- PEP 684 (per-interpreter GIL):
+  <https://peps.python.org/pep-0684/>
+- `msgspec` PEP 684 tracker:
+  <https://github.com/jcrist/msgspec/issues/563>
+- CPython `_interpretersmodule.c` source:
+  <https://github.com/python/cpython/blob/main/Modules/_interpretersmodule.c>
+- `tractor.spawn._subint` module docstring (in-tree
+  explanation of the legacy-mode choice and its
+  tradeoffs).
+
+## Reproducer
+
+```
+./py314/bin/python -m pytest \
+  tests/discovery/test_registrar.py::test_stale_entry_is_deleted \
+  --spawn-backend=subint \
+  --tb=short --no-header -v
+```
+
+Hangs indefinitely without the fixture-side SIGINT loop;
+with the loop, the test completes (albeit with the
+abandoned-thread warning in logs).
diff --git a/tests/discovery/test_registrar.py b/tests/discovery/test_registrar.py
index 60b2b10c..6f34b117 100644
--- a/tests/discovery/test_registrar.py
+++ b/tests/discovery/test_registrar.py
@@ -14,6 +14,7 @@ import psutil
 import pytest
 import subprocess
 import tractor
+from tractor.devx import dump_on_hang
 from tractor.trionics import collapse_eg
 from tractor._testing import tractor_test
 from tractor.discovery._addr import wrap_address
@@ -562,4 +563,53 @@ def test_stale_entry_is_deleted(
                 await ptl.cancel_actor()
                 await an.cancel()
 
-    trio.run(main)
+    # TODO, remove once the `[subint]` variant no longer hangs.
+    #
+    # Status (as of Phase B hard-kill landing):
+    #
+    # - `[trio]`/`[mp_*]` variants: completes normally; `dump_on_hang`
+    #   is a no-op safety net here.
+    #
+    # - `[subint]` variant: hangs indefinitely AND is un-Ctrl-C-able.
+    #   `strace -p <pytest_pid>` while in the hang reveals a silently-
+    #   dropped SIGINT — the C signal handler tries to write the
+    #   signum byte to Python's signal-wakeup fd and gets `EAGAIN`,
+    #   meaning the pipe is full (nobody's draining it).
+    #
+    #   Root-cause chain: our hard-kill in `spawn._subint` abandoned
+    #   the driver OS-thread (which is `daemon=True`) after the soft-
+    #   kill timeout, but the *sub-interpreter* inside that thread is
+    #   still running `trio.run()` — `_interpreters.destroy()` can't
+    #   force-stop a running subint (raises `InterpreterError`), and
+    #   legacy-config subints share the main GIL. The abandoned subint
+    #   starves the parent's trio event loop from iterating often
+    #   enough to drain its wakeup pipe → SIGINT silently drops.
+    #
+    #   This is structurally a CPython-level limitation: there's no
+    #   public force-destroy primitive for a running subint. We
+    #   escape on the harness side via a SIGINT-loop in the `daemon`
+    #   fixture teardown (killing the bg registrar subproc closes its
+    #   end of the IPC, which eventually unblocks a recv in main trio,
+    #   which lets the loop drain the wakeup pipe). Long-term fix path:
+    #   msgspec PEP 684 support (jcrist/msgspec#563) → isolated-mode
+    #   subints with per-interp GIL.
+    #
+    #   Full analysis:
+    #   `ai/conc-anal/subint_sigint_starvation_issue.md`
+    #
+    #   See also the *sibling* hang class documented in
+    #   `ai/conc-anal/subint_cancel_delivery_hang_issue.md` — same
+    #   subint backend, different root cause (Ctrl-C-able hang, main
+    #   trio loop iterating fine; ours to fix, not CPython's).
+    #   Reproduced by `tests/test_subint_cancellation.py
+    #   ::test_subint_non_checkpointing_child`.
+    #
+    # Kept here (and not behind a `pytestmark.skip`) so we can still
+    # inspect the dump file if the hang ever returns after a refactor.
+    # `pytest`'s stderr capture eats `faulthandler` output otherwise,
+    # so we route `dump_on_hang` to a file.
+    with dump_on_hang(
+        seconds=20,
+        path=f'/tmp/test_stale_entry_is_deleted_{start_method}.dump',
+    ):
+        trio.run(main)
diff --git a/tests/test_subint_cancellation.py b/tests/test_subint_cancellation.py
index 67523b8c..04b6cc9e 100644
--- a/tests/test_subint_cancellation.py
+++ b/tests/test_subint_cancellation.py
@@ -182,6 +182,32 @@ def test_subint_non_checkpointing_child(
     - ~3s: `_HARD_KILL_TIMEOUT` (thread-join wait)
     - margin
 
+    KNOWN ISSUE (Ctrl-C-able hang):
+    -------------------------------
+    This test currently hangs past the hard-kill timeout for
+    reasons unrelated to the subint teardown itself — after
+    the subint is destroyed, a parent-side trio task appears
+    to park on an orphaned IPC channel (no clean EOF
+    delivered to a waiting receive). Unlike the
+    SIGINT-starvation sibling case in
+    `test_stale_entry_is_deleted`, this hang IS Ctrl-C-able
+    (`strace` shows SIGINT wakeup-fd `write() = 1`, not
+    `EAGAIN`) — i.e. the main trio loop is still iterating
+    normally. That makes this *our* bug to fix, not a
+    CPython-level limitation.
+
+    See `ai/conc-anal/subint_cancel_delivery_hang_issue.md`
+    for the full analysis + candidate fix directions
+    (explicit parent-side channel abort in `subint_proc`
+    teardown being the most likely surgical fix).
+
+    The sibling `ai/conc-anal/subint_sigint_starvation_issue.md`
+    documents the *other* hang class (abandoned-legacy-subint
+    thread + shared-GIL starvation → signal-wakeup-fd pipe
+    fills → SIGINT silently dropped) — that one is
+    structurally blocked on msgspec PEP 684 adoption and is
+    NOT what this test is hitting.
+
     '''
     deadline: float = 15.0
     with dump_on_hang(