Doc `subint` backend hang classes + arm `dump_on_hang`

Classify and write up the two distinct hang modes hit during Phase
B subint bringup (issue #379) so future triage doesn't re-derive them
from scratch.

Deats, two new `ai/conc-anal/` docs,
- `subint_sigint_starvation_issue.md`: abandoned legacy-subint thread
  + shared GIL → main trio loop starves → signal-wakeup-fd pipe fills
  → `SIGINT` silently dropped (`strace` shows `write() = EAGAIN` on the
  wakeup-fd). Un- Ctrl-C-able. Structurally a CPython limit; blocked on
  `msgspec` PEP 684 (jcrist/msgspec#563)

- `subint_cancel_delivery_hang_issue.md`: parent-side trio task parks on
  an orphaned IPC channel after subint teardown — no clean EOF delivered
  to the waiting receive. Ctrl-C-able (main loop iterates fine); OUR bug
  to fix. Candidate fix: explicit parent-side channel abort in
  `subint_proc`'s hard-kill teardown

Cross-link the docs from their test reproducers,
- `test_stale_entry_is_deleted` (→ starvation class): wrap
  `trio.run(main)` in `dump_on_hang(seconds=20)` so a future regression
  captures a stack dump. Kept un- skipped so the dump file is
  inspectable

- `test_subint_non_checkpointing_child` (→ delivery class): extend
  docstring with a "KNOWN ISSUE" block pointing at the analysis

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
subint_forkserver_backend
Gud Boi 2026-04-20 15:28:00 -04:00
parent 2ed5e6a6e8
commit 4a3254583b
4 changed files with 443 additions and 1 deletions

View File

@ -0,0 +1,161 @@
# `subint` backend: parent trio loop parks after subint teardown (Ctrl-C works; not a CPython-level issue)
Follow-up to the Phase B subint spawn-backend PR (see
`tractor.spawn._subint`, issue #379). Distinct from the
`subint_sigint_starvation_issue.md` (SIGINT-unresponsive
starvation hang): this one is **Ctrl-C-able**, which means
it's *not* the shared-GIL-hostage class and is ours to fix
from inside tractor rather than waiting on upstream CPython
/ msgspec progress.
## TL;DR
After a stuck-subint subactor is torn down via the
hard-kill path, a parent-side trio task parks on an
*orphaned resource* (most likely a `chan.recv()` /
`process_messages` loop on the now-dead subint's IPC
channel) and waits forever for bytes that can't arrive —
because the channel was torn down without emitting a clean
EOF/`BrokenResourceError` to the waiting receiver.
Unlike `subint_sigint_starvation_issue.md`, the main trio
loop **is** iterating normally — SIGINT delivers cleanly
and the test unhangs. But absent Ctrl-C, the test suite
wedges indefinitely.
## Symptom
Running `test_subint_non_checkpointing_child` under
`--spawn-backend=subint` (in
`tests/test_subint_cancellation.py`):
1. Test spawns a subactor whose main task runs
`threading.Event.wait(1.0)` in a loop — releases the
GIL but never inserts a trio checkpoint.
2. Parent does `an.cancel_scope.cancel()`. Our
`subint_proc` cancel path fires: soft-kill sends
`Portal.cancel_actor()` over the live IPC channel →
subint's trio loop *should* process the cancel msg on
its IPC dispatcher task (since the GIL releases are
happening).
3. Expected: subint's `trio.run()` unwinds, driver thread
exits naturally, parent returns.
4. Actual: parent `trio.run()` never completes. Test
hangs past its `trio.fail_after()` deadline.
## Evidence
### `strace` on the hung pytest process during SIGINT
```
--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
write(17, "\2", 1) = 1
```
Contrast with the SIGINT-starvation hang (see
`subint_sigint_starvation_issue.md`) where that same
`write()` returned `EAGAIN`. Here the SIGINT byte is
written successfully → Python's signal handler pipe is
being drained → main trio loop **is** iterating → SIGINT
gets turned into `trio.Cancelled` → the test unhangs (if
the operator happens to be there to hit Ctrl-C).
### Stack dump (via `tractor.devx.dump_on_hang`)
Single main thread visible, parked in
`trio._core._io_epoll.get_events` inside `trio.run` at the
test's `trio.run(...)` call site. No subint driver thread
(subint was destroyed successfully — this is *after* the
hard-kill path, not during it).
## Root cause hypothesis
Most consistent with the evidence: a parent-side trio
task is awaiting a `chan.recv()` / `process_messages` loop
on the dead subint's IPC channel. The sequence:
1. Soft-kill in `subint_proc` sends `Portal.cancel_actor()`
over the channel. The subint's trio dispatcher *may* or
may not have processed the cancel msg before the subint
was destroyed — timing-dependent.
2. Hard-kill timeout fires (because the subint's main
task was in `threading.Event.wait()` with no trio
checkpoint — cancel-msg processing couldn't race the
timeout).
3. Driver thread abandoned, `_interpreters.destroy()`
runs. Subint is gone.
4. But the parent-side trio task holding a
`chan.recv()` / `process_messages` loop against that
channel was **not** explicitly cancelled. The channel's
underlying socket got torn down, but without a clean
EOF delivered to the waiting recv, the task parks
forever on `trio.lowlevel.wait_readable` (or similar).
This matches the "main loop fine, task parked on
orphaned I/O" signature.
## Why this is ours to fix (not CPython's)
- Main trio loop iterates normally → GIL isn't starved.
- SIGINT is deliverable → not a signal-pipe-full /
wakeup-fd contention scenario.
- The hang is in *our* supervision code, specifically in
how `subint_proc` tears down its side of the IPC when
the subint is abandoned/destroyed.
## Possible fix directions
1. **Explicit parent-side channel abort on subint
abandon.** In `subint_proc`'s teardown block, after the
hard-kill timeout fires, explicitly close the parent's
end of the IPC channel to the subint. Any waiting
`chan.recv()` / `process_messages` task sees
`BrokenResourceError` (or `ClosedResourceError`) and
unwinds.
2. **Cancel parent-side RPC tasks tied to the dead
subint's channel.** The `Actor._rpc_tasks` / nursery
machinery should have a handle on any
`process_messages` loops bound to a specific peer
channel. Iterate those and cancel explicitly.
3. **Bound the top-level `await actor_nursery
._join_procs.wait()` shield in `subint_proc`** (same
pattern as the other bounded shields the hard-kill
patch added). If the nursery never sets `_join_procs`
because a child task is parked, the bound would at
least let the teardown proceed.
Of these, (1) is the most surgical and directly addresses
the root cause. (2) is a defense-in-depth companion. (3)
is a band-aid but cheap to add.
## Current workaround
None in-tree. The test's `trio.fail_after()` bound
currently fires and raises `TooSlowError`, so the test
visibly **fails** rather than hangs — which is
intentional (an unbounded cancellation-audit test would
defeat itself). But in interactive test runs the operator
has to hit Ctrl-C to move past the parked state before
pytest reports the failure.
## Reproducer
```
./py314/bin/python -m pytest \
tests/test_subint_cancellation.py::test_subint_non_checkpointing_child \
--spawn-backend=subint --tb=short --no-header -v
```
Expected: hangs until `trio.fail_after(15)` fires, or
Ctrl-C unwedges it manually.
## References
- `tractor.spawn._subint.subint_proc` — current subint
teardown code; see the `_HARD_KILL_TIMEOUT` bounded
shields + `daemon=True` driver-thread abandonment
(commit `b025c982`).
- `ai/conc-anal/subint_sigint_starvation_issue.md` — the
sibling CPython-level hang (GIL-starvation,
SIGINT-unresponsive) which is **not** this issue.
- Phase B tracking: issue #379.

View File

@ -0,0 +1,205 @@
# `subint` backend: abandoned-subint thread can wedge main trio event loop (Ctrl-C unresponsive)
Follow-up to the Phase B subint spawn-backend PR (see
`tractor.spawn._subint`, issue #379). The hard-kill escape
hatch we landed (`_HARD_KILL_TIMEOUT`, bounded shields,
`daemon=True` driver-thread abandonment) handles *most*
stuck-subint scenarios cleanly, but there's one class of
hang that can't be fully escaped from within tractor: a
still-running abandoned sub-interpreter can starve the
**parent's** trio event loop to the point where **SIGINT is
effectively dropped by the kernel ↔ Python boundary** —
making the pytest process un-Ctrl-C-able.
## Symptom
Running `test_stale_entry_is_deleted[subint]` under
`--spawn-backend=subint`:
1. Test spawns a subactor (`transport_fails_actor`) which
kills its own IPC server and then
`trio.sleep_forever()`.
2. Parent tries `Portal.cancel_actor()` → channel
disconnected → fast return.
3. Nursery teardown triggers our `subint_proc` cancel path.
Portal-cancel fails (dead channel),
`_HARD_KILL_TIMEOUT` fires, driver thread is abandoned
(`daemon=True`), `_interpreters.destroy(interp_id)`
raises `InterpreterError` (because the subint is still
running).
4. Test appears to hang indefinitely at the *outer*
`async with tractor.open_nursery() as an:` exit.
5. `Ctrl-C` at the terminal does nothing. The pytest
process is un-interruptable.
## Evidence
### `strace` on the hung pytest process
```
--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
write(37, "\2", 1) = -1 EAGAIN (Resource temporarily unavailable)
rt_sigreturn({mask=[WINCH]}) = 140585542325792
```
Translated:
- Kernel delivers `SIGINT` to pytest.
- CPython's C-level signal handler fires and tries to
write the signal number byte (`0x02` = SIGINT) to fd 37
— the **Python signal-wakeup fd** (set via
`signal.set_wakeup_fd()`, which trio uses to wake its
event loop on signals).
- Write returns `EAGAIN`**the pipe is full**. Nothing
is draining it.
- `rt_sigreturn` with the signal masked off — signal is
"handled" from the kernel's perspective but the actual
Python-level handler (and therefore trio's
`KeyboardInterrupt` delivery) never runs.
### Stack dump (via `tractor.devx.dump_on_hang`)
At 20s into the hang, only the **main thread** is visible:
```
Thread 0x...7fdca0191780 [python] (most recent call first):
File ".../trio/_core/_io_epoll.py", line 245 in get_events
File ".../trio/_core/_run.py", line 2415 in run
File ".../tests/discovery/test_registrar.py", line 575 in test_stale_entry_is_deleted
...
```
No driver thread shows up. The abandoned-legacy-subint
thread still exists from the OS's POV (it's still running
inside `_interpreters.exec()` driving the subint's
`trio.run()` on `trio.sleep_forever()`) but the **main
interp's faulthandler can't see threads currently executing
inside a sub-interpreter's tstate**. Concretely: the thread
is alive, holding state we can't introspect from here.
## Root cause analysis
The most consistent explanation for both observations:
1. **Legacy-config subinterpreters share the main GIL.**
PEP 734's public `concurrent.interpreters.create()`
defaults to `'isolated'` (per-interp GIL), but tractor
uses `_interpreters.create('legacy')` as a workaround
for C extensions that don't yet support PEP 684
(notably `msgspec`, see
[jcrist/msgspec#563](https://github.com/jcrist/msgspec/issues/563)).
Legacy-mode subints share process-global state
including the GIL.
2. **Our abandoned subint thread never exits.** After our
hard-kill timeout, `driver_thread.join()` is abandoned
via `abandon_on_cancel=True` and the thread is
`daemon=True` so proc-exit won't block on it — but the
thread *itself* is still alive inside
`_interpreters.exec()`, driving a `trio.run()` that
will never return (the subint actor is in
`trio.sleep_forever()`).
3. **`_interpreters.destroy()` cannot force-stop a running
subint.** It raises `InterpreterError` on any
still-running subinterpreter; there is no public
CPython API to force-destroy one.
4. **Shared-GIL + non-terminating subint thread → main
trio loop starvation.** Under enough load (the subint's
trio event loop iterating in the background, IPC-layer
tasks still in the subint, etc.) the main trio event
loop can fail to iterate frequently enough to drain its
wakeup pipe. Once that pipe fills, `SIGINT` writes from
the C signal handler return `EAGAIN` and signals are
silently dropped — exactly what `strace` shows.
The shielded
`await actor_nursery._join_procs.wait()` at the top of
`subint_proc` (inherited unchanged from the `trio_proc`
pattern) is structurally involved too: if main trio *does*
get a schedule slice, it'd find the `subint_proc` task
parked on `_join_procs` under shield — which traps whatever
`Cancelled` arrives. But that's a second-order effect; the
signal-pipe-full condition is the primary "Ctrl-C doesn't
work" cause.
## Why we can't fix this from inside tractor
- **No force-destroy API.** CPython provides neither a
`_interpreters.force_destroy()` nor a thread-
cancellation primitive (`pthread_cancel` is actively
discouraged and unavailable on Windows). A subint stuck
in pure-Python loops (or worse, C code that doesn't poll
for signals) is structurally unreachable from outside.
- **Shared GIL is the root scheduling issue.** As long as
we're forced into legacy-mode subints for `msgspec`
compatibility, the abandoned-thread scenario is
fundamentally a process-global GIL-starvation window.
- **`signal.set_wakeup_fd()` is process-global.** Even if
we wanted to put our own drainer on the wakeup pipe,
only one party owns it at a time.
## Current workaround
- **Fixture-side SIGINT loop on the `daemon` subproc** (in
this test's `daemon: subprocess.Popen` fixture in
`tests/conftest.py`). The daemon dying closes its end of
the registry IPC, which unblocks a pending recv in main
trio's IPC-server task, which lets the event loop
iterate, which drains the wakeup pipe, which finally
delivers the test-harness SIGINT.
- **Module-level skip on py3.13**
(`pytest.importorskip('concurrent.interpreters')`) — the
private `_interpreters` C module exists on 3.13 but the
multi-trio-task interaction hangs silently there
independently of this issue.
## Path forward
1. **Primary**: upstream `msgspec` PEP 684 adoption
([jcrist/msgspec#563](https://github.com/jcrist/msgspec/issues/563)).
Unlocks `concurrent.interpreters.create()` isolated
mode → per-interp GIL → abandoned subint threads no
longer starve the parent's main trio loop. At that
point we can flip `_subint.py` back to the public API
(`create()` / `Interpreter.exec()` / `Interpreter.close()`)
and drop the private `_interpreters` path.
2. **Secondary**: watch CPython for a public
force-destroy primitive. If something like
`Interpreter.close(force=True)` lands, we can use it as
a hard-kill final stage and actually tear down
abandoned subints.
3. **Harness-level**: document the fixture-side SIGINT
loop pattern as the "known workaround" for subint-
backend tests that can leave background state holding
the main event loop hostage.
## References
- PEP 734 (`concurrent.interpreters`):
<https://peps.python.org/pep-0734/>
- PEP 684 (per-interpreter GIL):
<https://peps.python.org/pep-0684/>
- `msgspec` PEP 684 tracker:
<https://github.com/jcrist/msgspec/issues/563>
- CPython `_interpretersmodule.c` source:
<https://github.com/python/cpython/blob/main/Modules/_interpretersmodule.c>
- `tractor.spawn._subint` module docstring (in-tree
explanation of the legacy-mode choice and its
tradeoffs).
## Reproducer
```
./py314/bin/python -m pytest \
tests/discovery/test_registrar.py::test_stale_entry_is_deleted \
--spawn-backend=subint \
--tb=short --no-header -v
```
Hangs indefinitely without the fixture-side SIGINT loop;
with the loop, the test completes (albeit with the
abandoned-thread warning in logs).

View File

@ -14,6 +14,7 @@ import psutil
import pytest import pytest
import subprocess import subprocess
import tractor import tractor
from tractor.devx import dump_on_hang
from tractor.trionics import collapse_eg from tractor.trionics import collapse_eg
from tractor._testing import tractor_test from tractor._testing import tractor_test
from tractor.discovery._addr import wrap_address from tractor.discovery._addr import wrap_address
@ -562,4 +563,53 @@ def test_stale_entry_is_deleted(
await ptl.cancel_actor() await ptl.cancel_actor()
await an.cancel() await an.cancel()
trio.run(main) # TODO, remove once the `[subint]` variant no longer hangs.
#
# Status (as of Phase B hard-kill landing):
#
# - `[trio]`/`[mp_*]` variants: completes normally; `dump_on_hang`
# is a no-op safety net here.
#
# - `[subint]` variant: hangs indefinitely AND is un-Ctrl-C-able.
# `strace -p <pytest_pid>` while in the hang reveals a silently-
# dropped SIGINT — the C signal handler tries to write the
# signum byte to Python's signal-wakeup fd and gets `EAGAIN`,
# meaning the pipe is full (nobody's draining it).
#
# Root-cause chain: our hard-kill in `spawn._subint` abandoned
# the driver OS-thread (which is `daemon=True`) after the soft-
# kill timeout, but the *sub-interpreter* inside that thread is
# still running `trio.run()` — `_interpreters.destroy()` can't
# force-stop a running subint (raises `InterpreterError`), and
# legacy-config subints share the main GIL. The abandoned subint
# starves the parent's trio event loop from iterating often
# enough to drain its wakeup pipe → SIGINT silently drops.
#
# This is structurally a CPython-level limitation: there's no
# public force-destroy primitive for a running subint. We
# escape on the harness side via a SIGINT-loop in the `daemon`
# fixture teardown (killing the bg registrar subproc closes its
# end of the IPC, which eventually unblocks a recv in main trio,
# which lets the loop drain the wakeup pipe). Long-term fix path:
# msgspec PEP 684 support (jcrist/msgspec#563) → isolated-mode
# subints with per-interp GIL.
#
# Full analysis:
# `ai/conc-anal/subint_sigint_starvation_issue.md`
#
# See also the *sibling* hang class documented in
# `ai/conc-anal/subint_cancel_delivery_hang_issue.md` — same
# subint backend, different root cause (Ctrl-C-able hang, main
# trio loop iterating fine; ours to fix, not CPython's).
# Reproduced by `tests/test_subint_cancellation.py
# ::test_subint_non_checkpointing_child`.
#
# Kept here (and not behind a `pytestmark.skip`) so we can still
# inspect the dump file if the hang ever returns after a refactor.
# `pytest`'s stderr capture eats `faulthandler` output otherwise,
# so we route `dump_on_hang` to a file.
with dump_on_hang(
seconds=20,
path=f'/tmp/test_stale_entry_is_deleted_{start_method}.dump',
):
trio.run(main)

View File

@ -182,6 +182,32 @@ def test_subint_non_checkpointing_child(
- ~3s: `_HARD_KILL_TIMEOUT` (thread-join wait) - ~3s: `_HARD_KILL_TIMEOUT` (thread-join wait)
- margin - margin
KNOWN ISSUE (Ctrl-C-able hang):
-------------------------------
This test currently hangs past the hard-kill timeout for
reasons unrelated to the subint teardown itself after
the subint is destroyed, a parent-side trio task appears
to park on an orphaned IPC channel (no clean EOF
delivered to a waiting receive). Unlike the
SIGINT-starvation sibling case in
`test_stale_entry_is_deleted`, this hang IS Ctrl-C-able
(`strace` shows SIGINT wakeup-fd `write() = 1`, not
`EAGAIN`) i.e. the main trio loop is still iterating
normally. That makes this *our* bug to fix, not a
CPython-level limitation.
See `ai/conc-anal/subint_cancel_delivery_hang_issue.md`
for the full analysis + candidate fix directions
(explicit parent-side channel abort in `subint_proc`
teardown being the most likely surgical fix).
The sibling `ai/conc-anal/subint_sigint_starvation_issue.md`
documents the *other* hang class (abandoned-legacy-subint
thread + shared-GIL starvation signal-wakeup-fd pipe
fills SIGINT silently dropped) that one is
structurally blocked on msgspec PEP 684 adoption and is
NOT what this test is hitting.
''' '''
deadline: float = 15.0 deadline: float = 15.0
with dump_on_hang( with dump_on_hang(