Doc `subint` backend hang classes + arm `dump_on_hang`
Classify and write up the two distinct hang modes hit during Phase B subint bringup (issue #379) so future triage doesn't re-derive them from scratch. Deats, two new `ai/conc-anal/` docs, - `subint_sigint_starvation_issue.md`: abandoned legacy-subint thread + shared GIL → main trio loop starves → signal-wakeup-fd pipe fills → `SIGINT` silently dropped (`strace` shows `write() = EAGAIN` on the wakeup-fd). Un- Ctrl-C-able. Structurally a CPython limit; blocked on `msgspec` PEP 684 (jcrist/msgspec#563) - `subint_cancel_delivery_hang_issue.md`: parent-side trio task parks on an orphaned IPC channel after subint teardown — no clean EOF delivered to the waiting receive. Ctrl-C-able (main loop iterates fine); OUR bug to fix. Candidate fix: explicit parent-side channel abort in `subint_proc`'s hard-kill teardown Cross-link the docs from their test reproducers, - `test_stale_entry_is_deleted` (→ starvation class): wrap `trio.run(main)` in `dump_on_hang(seconds=20)` so a future regression captures a stack dump. Kept un- skipped so the dump file is inspectable - `test_subint_non_checkpointing_child` (→ delivery class): extend docstring with a "KNOWN ISSUE" block pointing at the analysis (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-codesubint_spawner_backend
parent
baf7ec54ac
commit
35796ec8ae
|
|
@ -0,0 +1,161 @@
|
||||||
|
# `subint` backend: parent trio loop parks after subint teardown (Ctrl-C works; not a CPython-level issue)
|
||||||
|
|
||||||
|
Follow-up to the Phase B subint spawn-backend PR (see
|
||||||
|
`tractor.spawn._subint`, issue #379). Distinct from the
|
||||||
|
`subint_sigint_starvation_issue.md` (SIGINT-unresponsive
|
||||||
|
starvation hang): this one is **Ctrl-C-able**, which means
|
||||||
|
it's *not* the shared-GIL-hostage class and is ours to fix
|
||||||
|
from inside tractor rather than waiting on upstream CPython
|
||||||
|
/ msgspec progress.
|
||||||
|
|
||||||
|
## TL;DR
|
||||||
|
|
||||||
|
After a stuck-subint subactor is torn down via the
|
||||||
|
hard-kill path, a parent-side trio task parks on an
|
||||||
|
*orphaned resource* (most likely a `chan.recv()` /
|
||||||
|
`process_messages` loop on the now-dead subint's IPC
|
||||||
|
channel) and waits forever for bytes that can't arrive —
|
||||||
|
because the channel was torn down without emitting a clean
|
||||||
|
EOF/`BrokenResourceError` to the waiting receiver.
|
||||||
|
|
||||||
|
Unlike `subint_sigint_starvation_issue.md`, the main trio
|
||||||
|
loop **is** iterating normally — SIGINT delivers cleanly
|
||||||
|
and the test unhangs. But absent Ctrl-C, the test suite
|
||||||
|
wedges indefinitely.
|
||||||
|
|
||||||
|
## Symptom
|
||||||
|
|
||||||
|
Running `test_subint_non_checkpointing_child` under
|
||||||
|
`--spawn-backend=subint` (in
|
||||||
|
`tests/test_subint_cancellation.py`):
|
||||||
|
|
||||||
|
1. Test spawns a subactor whose main task runs
|
||||||
|
`threading.Event.wait(1.0)` in a loop — releases the
|
||||||
|
GIL but never inserts a trio checkpoint.
|
||||||
|
2. Parent does `an.cancel_scope.cancel()`. Our
|
||||||
|
`subint_proc` cancel path fires: soft-kill sends
|
||||||
|
`Portal.cancel_actor()` over the live IPC channel →
|
||||||
|
subint's trio loop *should* process the cancel msg on
|
||||||
|
its IPC dispatcher task (since the GIL releases are
|
||||||
|
happening).
|
||||||
|
3. Expected: subint's `trio.run()` unwinds, driver thread
|
||||||
|
exits naturally, parent returns.
|
||||||
|
4. Actual: parent `trio.run()` never completes. Test
|
||||||
|
hangs past its `trio.fail_after()` deadline.
|
||||||
|
|
||||||
|
## Evidence
|
||||||
|
|
||||||
|
### `strace` on the hung pytest process during SIGINT
|
||||||
|
|
||||||
|
```
|
||||||
|
--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
|
||||||
|
write(17, "\2", 1) = 1
|
||||||
|
```
|
||||||
|
|
||||||
|
Contrast with the SIGINT-starvation hang (see
|
||||||
|
`subint_sigint_starvation_issue.md`) where that same
|
||||||
|
`write()` returned `EAGAIN`. Here the SIGINT byte is
|
||||||
|
written successfully → Python's signal handler pipe is
|
||||||
|
being drained → main trio loop **is** iterating → SIGINT
|
||||||
|
gets turned into `trio.Cancelled` → the test unhangs (if
|
||||||
|
the operator happens to be there to hit Ctrl-C).
|
||||||
|
|
||||||
|
### Stack dump (via `tractor.devx.dump_on_hang`)
|
||||||
|
|
||||||
|
Single main thread visible, parked in
|
||||||
|
`trio._core._io_epoll.get_events` inside `trio.run` at the
|
||||||
|
test's `trio.run(...)` call site. No subint driver thread
|
||||||
|
(subint was destroyed successfully — this is *after* the
|
||||||
|
hard-kill path, not during it).
|
||||||
|
|
||||||
|
## Root cause hypothesis
|
||||||
|
|
||||||
|
Most consistent with the evidence: a parent-side trio
|
||||||
|
task is awaiting a `chan.recv()` / `process_messages` loop
|
||||||
|
on the dead subint's IPC channel. The sequence:
|
||||||
|
|
||||||
|
1. Soft-kill in `subint_proc` sends `Portal.cancel_actor()`
|
||||||
|
over the channel. The subint's trio dispatcher *may* or
|
||||||
|
may not have processed the cancel msg before the subint
|
||||||
|
was destroyed — timing-dependent.
|
||||||
|
2. Hard-kill timeout fires (because the subint's main
|
||||||
|
task was in `threading.Event.wait()` with no trio
|
||||||
|
checkpoint — cancel-msg processing couldn't race the
|
||||||
|
timeout).
|
||||||
|
3. Driver thread abandoned, `_interpreters.destroy()`
|
||||||
|
runs. Subint is gone.
|
||||||
|
4. But the parent-side trio task holding a
|
||||||
|
`chan.recv()` / `process_messages` loop against that
|
||||||
|
channel was **not** explicitly cancelled. The channel's
|
||||||
|
underlying socket got torn down, but without a clean
|
||||||
|
EOF delivered to the waiting recv, the task parks
|
||||||
|
forever on `trio.lowlevel.wait_readable` (or similar).
|
||||||
|
|
||||||
|
This matches the "main loop fine, task parked on
|
||||||
|
orphaned I/O" signature.
|
||||||
|
|
||||||
|
## Why this is ours to fix (not CPython's)
|
||||||
|
|
||||||
|
- Main trio loop iterates normally → GIL isn't starved.
|
||||||
|
- SIGINT is deliverable → not a signal-pipe-full /
|
||||||
|
wakeup-fd contention scenario.
|
||||||
|
- The hang is in *our* supervision code, specifically in
|
||||||
|
how `subint_proc` tears down its side of the IPC when
|
||||||
|
the subint is abandoned/destroyed.
|
||||||
|
|
||||||
|
## Possible fix directions
|
||||||
|
|
||||||
|
1. **Explicit parent-side channel abort on subint
|
||||||
|
abandon.** In `subint_proc`'s teardown block, after the
|
||||||
|
hard-kill timeout fires, explicitly close the parent's
|
||||||
|
end of the IPC channel to the subint. Any waiting
|
||||||
|
`chan.recv()` / `process_messages` task sees
|
||||||
|
`BrokenResourceError` (or `ClosedResourceError`) and
|
||||||
|
unwinds.
|
||||||
|
2. **Cancel parent-side RPC tasks tied to the dead
|
||||||
|
subint's channel.** The `Actor._rpc_tasks` / nursery
|
||||||
|
machinery should have a handle on any
|
||||||
|
`process_messages` loops bound to a specific peer
|
||||||
|
channel. Iterate those and cancel explicitly.
|
||||||
|
3. **Bound the top-level `await actor_nursery
|
||||||
|
._join_procs.wait()` shield in `subint_proc`** (same
|
||||||
|
pattern as the other bounded shields the hard-kill
|
||||||
|
patch added). If the nursery never sets `_join_procs`
|
||||||
|
because a child task is parked, the bound would at
|
||||||
|
least let the teardown proceed.
|
||||||
|
|
||||||
|
Of these, (1) is the most surgical and directly addresses
|
||||||
|
the root cause. (2) is a defense-in-depth companion. (3)
|
||||||
|
is a band-aid but cheap to add.
|
||||||
|
|
||||||
|
## Current workaround
|
||||||
|
|
||||||
|
None in-tree. The test's `trio.fail_after()` bound
|
||||||
|
currently fires and raises `TooSlowError`, so the test
|
||||||
|
visibly **fails** rather than hangs — which is
|
||||||
|
intentional (an unbounded cancellation-audit test would
|
||||||
|
defeat itself). But in interactive test runs the operator
|
||||||
|
has to hit Ctrl-C to move past the parked state before
|
||||||
|
pytest reports the failure.
|
||||||
|
|
||||||
|
## Reproducer
|
||||||
|
|
||||||
|
```
|
||||||
|
./py314/bin/python -m pytest \
|
||||||
|
tests/test_subint_cancellation.py::test_subint_non_checkpointing_child \
|
||||||
|
--spawn-backend=subint --tb=short --no-header -v
|
||||||
|
```
|
||||||
|
|
||||||
|
Expected: hangs until `trio.fail_after(15)` fires, or
|
||||||
|
Ctrl-C unwedges it manually.
|
||||||
|
|
||||||
|
## References
|
||||||
|
|
||||||
|
- `tractor.spawn._subint.subint_proc` — current subint
|
||||||
|
teardown code; see the `_HARD_KILL_TIMEOUT` bounded
|
||||||
|
shields + `daemon=True` driver-thread abandonment
|
||||||
|
(commit `b025c982`).
|
||||||
|
- `ai/conc-anal/subint_sigint_starvation_issue.md` — the
|
||||||
|
sibling CPython-level hang (GIL-starvation,
|
||||||
|
SIGINT-unresponsive) which is **not** this issue.
|
||||||
|
- Phase B tracking: issue #379.
|
||||||
|
|
@ -0,0 +1,205 @@
|
||||||
|
# `subint` backend: abandoned-subint thread can wedge main trio event loop (Ctrl-C unresponsive)
|
||||||
|
|
||||||
|
Follow-up to the Phase B subint spawn-backend PR (see
|
||||||
|
`tractor.spawn._subint`, issue #379). The hard-kill escape
|
||||||
|
hatch we landed (`_HARD_KILL_TIMEOUT`, bounded shields,
|
||||||
|
`daemon=True` driver-thread abandonment) handles *most*
|
||||||
|
stuck-subint scenarios cleanly, but there's one class of
|
||||||
|
hang that can't be fully escaped from within tractor: a
|
||||||
|
still-running abandoned sub-interpreter can starve the
|
||||||
|
**parent's** trio event loop to the point where **SIGINT is
|
||||||
|
effectively dropped by the kernel ↔ Python boundary** —
|
||||||
|
making the pytest process un-Ctrl-C-able.
|
||||||
|
|
||||||
|
## Symptom
|
||||||
|
|
||||||
|
Running `test_stale_entry_is_deleted[subint]` under
|
||||||
|
`--spawn-backend=subint`:
|
||||||
|
|
||||||
|
1. Test spawns a subactor (`transport_fails_actor`) which
|
||||||
|
kills its own IPC server and then
|
||||||
|
`trio.sleep_forever()`.
|
||||||
|
2. Parent tries `Portal.cancel_actor()` → channel
|
||||||
|
disconnected → fast return.
|
||||||
|
3. Nursery teardown triggers our `subint_proc` cancel path.
|
||||||
|
Portal-cancel fails (dead channel),
|
||||||
|
`_HARD_KILL_TIMEOUT` fires, driver thread is abandoned
|
||||||
|
(`daemon=True`), `_interpreters.destroy(interp_id)`
|
||||||
|
raises `InterpreterError` (because the subint is still
|
||||||
|
running).
|
||||||
|
4. Test appears to hang indefinitely at the *outer*
|
||||||
|
`async with tractor.open_nursery() as an:` exit.
|
||||||
|
5. `Ctrl-C` at the terminal does nothing. The pytest
|
||||||
|
process is un-interruptable.
|
||||||
|
|
||||||
|
## Evidence
|
||||||
|
|
||||||
|
### `strace` on the hung pytest process
|
||||||
|
|
||||||
|
```
|
||||||
|
--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
|
||||||
|
write(37, "\2", 1) = -1 EAGAIN (Resource temporarily unavailable)
|
||||||
|
rt_sigreturn({mask=[WINCH]}) = 140585542325792
|
||||||
|
```
|
||||||
|
|
||||||
|
Translated:
|
||||||
|
|
||||||
|
- Kernel delivers `SIGINT` to pytest.
|
||||||
|
- CPython's C-level signal handler fires and tries to
|
||||||
|
write the signal number byte (`0x02` = SIGINT) to fd 37
|
||||||
|
— the **Python signal-wakeup fd** (set via
|
||||||
|
`signal.set_wakeup_fd()`, which trio uses to wake its
|
||||||
|
event loop on signals).
|
||||||
|
- Write returns `EAGAIN` — **the pipe is full**. Nothing
|
||||||
|
is draining it.
|
||||||
|
- `rt_sigreturn` with the signal masked off — signal is
|
||||||
|
"handled" from the kernel's perspective but the actual
|
||||||
|
Python-level handler (and therefore trio's
|
||||||
|
`KeyboardInterrupt` delivery) never runs.
|
||||||
|
|
||||||
|
### Stack dump (via `tractor.devx.dump_on_hang`)
|
||||||
|
|
||||||
|
At 20s into the hang, only the **main thread** is visible:
|
||||||
|
|
||||||
|
```
|
||||||
|
Thread 0x...7fdca0191780 [python] (most recent call first):
|
||||||
|
File ".../trio/_core/_io_epoll.py", line 245 in get_events
|
||||||
|
File ".../trio/_core/_run.py", line 2415 in run
|
||||||
|
File ".../tests/discovery/test_registrar.py", line 575 in test_stale_entry_is_deleted
|
||||||
|
...
|
||||||
|
```
|
||||||
|
|
||||||
|
No driver thread shows up. The abandoned-legacy-subint
|
||||||
|
thread still exists from the OS's POV (it's still running
|
||||||
|
inside `_interpreters.exec()` driving the subint's
|
||||||
|
`trio.run()` on `trio.sleep_forever()`) but the **main
|
||||||
|
interp's faulthandler can't see threads currently executing
|
||||||
|
inside a sub-interpreter's tstate**. Concretely: the thread
|
||||||
|
is alive, holding state we can't introspect from here.
|
||||||
|
|
||||||
|
## Root cause analysis
|
||||||
|
|
||||||
|
The most consistent explanation for both observations:
|
||||||
|
|
||||||
|
1. **Legacy-config subinterpreters share the main GIL.**
|
||||||
|
PEP 734's public `concurrent.interpreters.create()`
|
||||||
|
defaults to `'isolated'` (per-interp GIL), but tractor
|
||||||
|
uses `_interpreters.create('legacy')` as a workaround
|
||||||
|
for C extensions that don't yet support PEP 684
|
||||||
|
(notably `msgspec`, see
|
||||||
|
[jcrist/msgspec#563](https://github.com/jcrist/msgspec/issues/563)).
|
||||||
|
Legacy-mode subints share process-global state
|
||||||
|
including the GIL.
|
||||||
|
|
||||||
|
2. **Our abandoned subint thread never exits.** After our
|
||||||
|
hard-kill timeout, `driver_thread.join()` is abandoned
|
||||||
|
via `abandon_on_cancel=True` and the thread is
|
||||||
|
`daemon=True` so proc-exit won't block on it — but the
|
||||||
|
thread *itself* is still alive inside
|
||||||
|
`_interpreters.exec()`, driving a `trio.run()` that
|
||||||
|
will never return (the subint actor is in
|
||||||
|
`trio.sleep_forever()`).
|
||||||
|
|
||||||
|
3. **`_interpreters.destroy()` cannot force-stop a running
|
||||||
|
subint.** It raises `InterpreterError` on any
|
||||||
|
still-running subinterpreter; there is no public
|
||||||
|
CPython API to force-destroy one.
|
||||||
|
|
||||||
|
4. **Shared-GIL + non-terminating subint thread → main
|
||||||
|
trio loop starvation.** Under enough load (the subint's
|
||||||
|
trio event loop iterating in the background, IPC-layer
|
||||||
|
tasks still in the subint, etc.) the main trio event
|
||||||
|
loop can fail to iterate frequently enough to drain its
|
||||||
|
wakeup pipe. Once that pipe fills, `SIGINT` writes from
|
||||||
|
the C signal handler return `EAGAIN` and signals are
|
||||||
|
silently dropped — exactly what `strace` shows.
|
||||||
|
|
||||||
|
The shielded
|
||||||
|
`await actor_nursery._join_procs.wait()` at the top of
|
||||||
|
`subint_proc` (inherited unchanged from the `trio_proc`
|
||||||
|
pattern) is structurally involved too: if main trio *does*
|
||||||
|
get a schedule slice, it'd find the `subint_proc` task
|
||||||
|
parked on `_join_procs` under shield — which traps whatever
|
||||||
|
`Cancelled` arrives. But that's a second-order effect; the
|
||||||
|
signal-pipe-full condition is the primary "Ctrl-C doesn't
|
||||||
|
work" cause.
|
||||||
|
|
||||||
|
## Why we can't fix this from inside tractor
|
||||||
|
|
||||||
|
- **No force-destroy API.** CPython provides neither a
|
||||||
|
`_interpreters.force_destroy()` nor a thread-
|
||||||
|
cancellation primitive (`pthread_cancel` is actively
|
||||||
|
discouraged and unavailable on Windows). A subint stuck
|
||||||
|
in pure-Python loops (or worse, C code that doesn't poll
|
||||||
|
for signals) is structurally unreachable from outside.
|
||||||
|
- **Shared GIL is the root scheduling issue.** As long as
|
||||||
|
we're forced into legacy-mode subints for `msgspec`
|
||||||
|
compatibility, the abandoned-thread scenario is
|
||||||
|
fundamentally a process-global GIL-starvation window.
|
||||||
|
- **`signal.set_wakeup_fd()` is process-global.** Even if
|
||||||
|
we wanted to put our own drainer on the wakeup pipe,
|
||||||
|
only one party owns it at a time.
|
||||||
|
|
||||||
|
## Current workaround
|
||||||
|
|
||||||
|
- **Fixture-side SIGINT loop on the `daemon` subproc** (in
|
||||||
|
this test's `daemon: subprocess.Popen` fixture in
|
||||||
|
`tests/conftest.py`). The daemon dying closes its end of
|
||||||
|
the registry IPC, which unblocks a pending recv in main
|
||||||
|
trio's IPC-server task, which lets the event loop
|
||||||
|
iterate, which drains the wakeup pipe, which finally
|
||||||
|
delivers the test-harness SIGINT.
|
||||||
|
- **Module-level skip on py3.13**
|
||||||
|
(`pytest.importorskip('concurrent.interpreters')`) — the
|
||||||
|
private `_interpreters` C module exists on 3.13 but the
|
||||||
|
multi-trio-task interaction hangs silently there
|
||||||
|
independently of this issue.
|
||||||
|
|
||||||
|
## Path forward
|
||||||
|
|
||||||
|
1. **Primary**: upstream `msgspec` PEP 684 adoption
|
||||||
|
([jcrist/msgspec#563](https://github.com/jcrist/msgspec/issues/563)).
|
||||||
|
Unlocks `concurrent.interpreters.create()` isolated
|
||||||
|
mode → per-interp GIL → abandoned subint threads no
|
||||||
|
longer starve the parent's main trio loop. At that
|
||||||
|
point we can flip `_subint.py` back to the public API
|
||||||
|
(`create()` / `Interpreter.exec()` / `Interpreter.close()`)
|
||||||
|
and drop the private `_interpreters` path.
|
||||||
|
|
||||||
|
2. **Secondary**: watch CPython for a public
|
||||||
|
force-destroy primitive. If something like
|
||||||
|
`Interpreter.close(force=True)` lands, we can use it as
|
||||||
|
a hard-kill final stage and actually tear down
|
||||||
|
abandoned subints.
|
||||||
|
|
||||||
|
3. **Harness-level**: document the fixture-side SIGINT
|
||||||
|
loop pattern as the "known workaround" for subint-
|
||||||
|
backend tests that can leave background state holding
|
||||||
|
the main event loop hostage.
|
||||||
|
|
||||||
|
## References
|
||||||
|
|
||||||
|
- PEP 734 (`concurrent.interpreters`):
|
||||||
|
<https://peps.python.org/pep-0734/>
|
||||||
|
- PEP 684 (per-interpreter GIL):
|
||||||
|
<https://peps.python.org/pep-0684/>
|
||||||
|
- `msgspec` PEP 684 tracker:
|
||||||
|
<https://github.com/jcrist/msgspec/issues/563>
|
||||||
|
- CPython `_interpretersmodule.c` source:
|
||||||
|
<https://github.com/python/cpython/blob/main/Modules/_interpretersmodule.c>
|
||||||
|
- `tractor.spawn._subint` module docstring (in-tree
|
||||||
|
explanation of the legacy-mode choice and its
|
||||||
|
tradeoffs).
|
||||||
|
|
||||||
|
## Reproducer
|
||||||
|
|
||||||
|
```
|
||||||
|
./py314/bin/python -m pytest \
|
||||||
|
tests/discovery/test_registrar.py::test_stale_entry_is_deleted \
|
||||||
|
--spawn-backend=subint \
|
||||||
|
--tb=short --no-header -v
|
||||||
|
```
|
||||||
|
|
||||||
|
Hangs indefinitely without the fixture-side SIGINT loop;
|
||||||
|
with the loop, the test completes (albeit with the
|
||||||
|
abandoned-thread warning in logs).
|
||||||
|
|
@ -14,6 +14,7 @@ import psutil
|
||||||
import pytest
|
import pytest
|
||||||
import subprocess
|
import subprocess
|
||||||
import tractor
|
import tractor
|
||||||
|
from tractor.devx import dump_on_hang
|
||||||
from tractor.trionics import collapse_eg
|
from tractor.trionics import collapse_eg
|
||||||
from tractor._testing import tractor_test
|
from tractor._testing import tractor_test
|
||||||
from tractor.discovery._addr import wrap_address
|
from tractor.discovery._addr import wrap_address
|
||||||
|
|
@ -562,4 +563,53 @@ def test_stale_entry_is_deleted(
|
||||||
await ptl.cancel_actor()
|
await ptl.cancel_actor()
|
||||||
await an.cancel()
|
await an.cancel()
|
||||||
|
|
||||||
|
# TODO, remove once the `[subint]` variant no longer hangs.
|
||||||
|
#
|
||||||
|
# Status (as of Phase B hard-kill landing):
|
||||||
|
#
|
||||||
|
# - `[trio]`/`[mp_*]` variants: completes normally; `dump_on_hang`
|
||||||
|
# is a no-op safety net here.
|
||||||
|
#
|
||||||
|
# - `[subint]` variant: hangs indefinitely AND is un-Ctrl-C-able.
|
||||||
|
# `strace -p <pytest_pid>` while in the hang reveals a silently-
|
||||||
|
# dropped SIGINT — the C signal handler tries to write the
|
||||||
|
# signum byte to Python's signal-wakeup fd and gets `EAGAIN`,
|
||||||
|
# meaning the pipe is full (nobody's draining it).
|
||||||
|
#
|
||||||
|
# Root-cause chain: our hard-kill in `spawn._subint` abandoned
|
||||||
|
# the driver OS-thread (which is `daemon=True`) after the soft-
|
||||||
|
# kill timeout, but the *sub-interpreter* inside that thread is
|
||||||
|
# still running `trio.run()` — `_interpreters.destroy()` can't
|
||||||
|
# force-stop a running subint (raises `InterpreterError`), and
|
||||||
|
# legacy-config subints share the main GIL. The abandoned subint
|
||||||
|
# starves the parent's trio event loop from iterating often
|
||||||
|
# enough to drain its wakeup pipe → SIGINT silently drops.
|
||||||
|
#
|
||||||
|
# This is structurally a CPython-level limitation: there's no
|
||||||
|
# public force-destroy primitive for a running subint. We
|
||||||
|
# escape on the harness side via a SIGINT-loop in the `daemon`
|
||||||
|
# fixture teardown (killing the bg registrar subproc closes its
|
||||||
|
# end of the IPC, which eventually unblocks a recv in main trio,
|
||||||
|
# which lets the loop drain the wakeup pipe). Long-term fix path:
|
||||||
|
# msgspec PEP 684 support (jcrist/msgspec#563) → isolated-mode
|
||||||
|
# subints with per-interp GIL.
|
||||||
|
#
|
||||||
|
# Full analysis:
|
||||||
|
# `ai/conc-anal/subint_sigint_starvation_issue.md`
|
||||||
|
#
|
||||||
|
# See also the *sibling* hang class documented in
|
||||||
|
# `ai/conc-anal/subint_cancel_delivery_hang_issue.md` — same
|
||||||
|
# subint backend, different root cause (Ctrl-C-able hang, main
|
||||||
|
# trio loop iterating fine; ours to fix, not CPython's).
|
||||||
|
# Reproduced by `tests/test_subint_cancellation.py
|
||||||
|
# ::test_subint_non_checkpointing_child`.
|
||||||
|
#
|
||||||
|
# Kept here (and not behind a `pytestmark.skip`) so we can still
|
||||||
|
# inspect the dump file if the hang ever returns after a refactor.
|
||||||
|
# `pytest`'s stderr capture eats `faulthandler` output otherwise,
|
||||||
|
# so we route `dump_on_hang` to a file.
|
||||||
|
with dump_on_hang(
|
||||||
|
seconds=20,
|
||||||
|
path=f'/tmp/test_stale_entry_is_deleted_{start_method}.dump',
|
||||||
|
):
|
||||||
trio.run(main)
|
trio.run(main)
|
||||||
|
|
|
||||||
|
|
@ -182,6 +182,32 @@ def test_subint_non_checkpointing_child(
|
||||||
- ~3s: `_HARD_KILL_TIMEOUT` (thread-join wait)
|
- ~3s: `_HARD_KILL_TIMEOUT` (thread-join wait)
|
||||||
- margin
|
- margin
|
||||||
|
|
||||||
|
KNOWN ISSUE (Ctrl-C-able hang):
|
||||||
|
-------------------------------
|
||||||
|
This test currently hangs past the hard-kill timeout for
|
||||||
|
reasons unrelated to the subint teardown itself — after
|
||||||
|
the subint is destroyed, a parent-side trio task appears
|
||||||
|
to park on an orphaned IPC channel (no clean EOF
|
||||||
|
delivered to a waiting receive). Unlike the
|
||||||
|
SIGINT-starvation sibling case in
|
||||||
|
`test_stale_entry_is_deleted`, this hang IS Ctrl-C-able
|
||||||
|
(`strace` shows SIGINT wakeup-fd `write() = 1`, not
|
||||||
|
`EAGAIN`) — i.e. the main trio loop is still iterating
|
||||||
|
normally. That makes this *our* bug to fix, not a
|
||||||
|
CPython-level limitation.
|
||||||
|
|
||||||
|
See `ai/conc-anal/subint_cancel_delivery_hang_issue.md`
|
||||||
|
for the full analysis + candidate fix directions
|
||||||
|
(explicit parent-side channel abort in `subint_proc`
|
||||||
|
teardown being the most likely surgical fix).
|
||||||
|
|
||||||
|
The sibling `ai/conc-anal/subint_sigint_starvation_issue.md`
|
||||||
|
documents the *other* hang class (abandoned-legacy-subint
|
||||||
|
thread + shared-GIL starvation → signal-wakeup-fd pipe
|
||||||
|
fills → SIGINT silently dropped) — that one is
|
||||||
|
structurally blocked on msgspec PEP 684 adoption and is
|
||||||
|
NOT what this test is hitting.
|
||||||
|
|
||||||
'''
|
'''
|
||||||
deadline: float = 15.0
|
deadline: float = 15.0
|
||||||
with dump_on_hang(
|
with dump_on_hang(
|
||||||
|
|
|
||||||
Loading…
Reference in New Issue