7.9 KiB
subint backend: abandoned-subint thread can wedge main trio event loop (Ctrl-C unresponsive)
Follow-up to the Phase B subint spawn-backend PR (see tractor.spawn._subint, issue #379). The hard-kill escape hatch we landed (_HARD_KILL_TIMEOUT, bounded shields, daemon=True driver-thread abandonment) handles most stuck-subint scenarios cleanly, but there’s one class of hang that can’t be fully escaped from within tractor: a still-running abandoned sub-interpreter can starve the parent’s trio event loop to the point where SIGINT is effectively dropped by the kernel ↔︎ Python boundary — making the pytest process un-Ctrl-C-able.
Symptom
Running test_stale_entry_is_deleted[subint] under --spawn-backend=subint:
- Test spawns a subactor (
transport_fails_actor) which kills its own IPC server and thentrio.sleep_forever(). - Parent tries
Portal.cancel_actor()→ channel disconnected → fast return. - Nursery teardown triggers our
subint_proccancel path. Portal-cancel fails (dead channel),_HARD_KILL_TIMEOUTfires, driver thread is abandoned (daemon=True),_interpreters.destroy(interp_id)raisesInterpreterError(because the subint is still running). - Test appears to hang indefinitely at the outer
async with tractor.open_nursery() as an:exit. Ctrl-Cat the terminal does nothing. The pytest process is un-interruptable.
Evidence
strace on the hung pytest process
--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
write(37, "\2", 1) = -1 EAGAIN (Resource temporarily unavailable)
rt_sigreturn({mask=[WINCH]}) = 140585542325792
Translated:
- Kernel delivers
SIGINTto pytest. - CPython’s C-level signal handler fires and tries to write the signal number byte (
0x02= SIGINT) to fd 37 — the Python signal-wakeup fd (set viasignal.set_wakeup_fd(), which trio uses to wake its event loop on signals). - Write returns
EAGAIN— the pipe is full. Nothing is draining it. rt_sigreturnwith the signal masked off — signal is “handled” from the kernel’s perspective but the actual Python-level handler (and therefore trio’sKeyboardInterruptdelivery) never runs.
Stack dump (via tractor.devx.dump_on_hang)
At 20s into the hang, only the main thread is visible:
Thread 0x...7fdca0191780 [python] (most recent call first):
File ".../trio/_core/_io_epoll.py", line 245 in get_events
File ".../trio/_core/_run.py", line 2415 in run
File ".../tests/discovery/test_registrar.py", line 575 in test_stale_entry_is_deleted
...
No driver thread shows up. The abandoned-legacy-subint thread still exists from the OS’s POV (it’s still running inside _interpreters.exec() driving the subint’s trio.run() on trio.sleep_forever()) but the main interp’s faulthandler can’t see threads currently executing inside a sub-interpreter’s tstate. Concretely: the thread is alive, holding state we can’t introspect from here.
Root cause analysis
The most consistent explanation for both observations:
Legacy-config subinterpreters share the main GIL. PEP 734’s public
concurrent.interpreters.create()defaults to'isolated'(per-interp GIL), but tractor uses_interpreters.create('legacy')as a workaround for C extensions that don’t yet support PEP 684 (notablymsgspec, see jcrist/msgspec#563). Legacy-mode subints share process-global state including the GIL.Our abandoned subint thread never exits. After our hard-kill timeout,
driver_thread.join()is abandoned viaabandon_on_cancel=Trueand the thread isdaemon=Trueso proc-exit won’t block on it — but the thread itself is still alive inside_interpreters.exec(), driving atrio.run()that will never return (the subint actor is intrio.sleep_forever())._interpreters.destroy()cannot force-stop a running subint. It raisesInterpreterErroron any still-running subinterpreter; there is no public CPython API to force-destroy one.Shared-GIL + non-terminating subint thread → main trio loop starvation. Under enough load (the subint’s trio event loop iterating in the background, IPC-layer tasks still in the subint, etc.) the main trio event loop can fail to iterate frequently enough to drain its wakeup pipe. Once that pipe fills,
SIGINTwrites from the C signal handler returnEAGAINand signals are silently dropped — exactly whatstraceshows.
The shielded await actor_nursery._join_procs.wait() at the top of subint_proc (inherited unchanged from the trio_proc pattern) is structurally involved too: if main trio does get a schedule slice, it’d find the subint_proc task parked on _join_procs under shield — which traps whatever Cancelled arrives. But that’s a second-order effect; the signal-pipe-full condition is the primary “Ctrl-C doesn’t work” cause.
Why we can’t fix this from inside tractor
- No force-destroy API. CPython provides neither a
_interpreters.force_destroy()nor a thread- cancellation primitive (pthread_cancelis actively discouraged and unavailable on Windows). A subint stuck in pure-Python loops (or worse, C code that doesn’t poll for signals) is structurally unreachable from outside. - Shared GIL is the root scheduling issue. As long as we’re forced into legacy-mode subints for
msgspeccompatibility, the abandoned-thread scenario is fundamentally a process-global GIL-starvation window. signal.set_wakeup_fd()is process-global. Even if we wanted to put our own drainer on the wakeup pipe, only one party owns it at a time.
Current workaround
- Fixture-side SIGINT loop on the
daemonsubproc (in this test’sdaemon: subprocess.Popenfixture intests/conftest.py). The daemon dying closes its end of the registry IPC, which unblocks a pending recv in main trio’s IPC-server task, which lets the event loop iterate, which drains the wakeup pipe, which finally delivers the test-harness SIGINT. - Module-level skip on py3.13 (
pytest.importorskip('concurrent.interpreters')) — the private_interpretersC module exists on 3.13 but the multi-trio-task interaction hangs silently there independently of this issue.
Path forward
Primary: upstream
msgspecPEP 684 adoption (jcrist/msgspec#563). Unlocksconcurrent.interpreters.create()isolated mode → per-interp GIL → abandoned subint threads no longer starve the parent’s main trio loop. At that point we can flip_subint.pyback to the public API (create()/Interpreter.exec()/Interpreter.close()) and drop the private_interpreterspath.Secondary: watch CPython for a public force-destroy primitive. If something like
Interpreter.close(force=True)lands, we can use it as a hard-kill final stage and actually tear down abandoned subints.Harness-level: document the fixture-side SIGINT loop pattern as the “known workaround” for subint- backend tests that can leave background state holding the main event loop hostage.
References
- PEP 734 (
concurrent.interpreters): https://peps.python.org/pep-0734/ - PEP 684 (per-interpreter GIL): https://peps.python.org/pep-0684/
msgspecPEP 684 tracker: https://github.com/jcrist/msgspec/issues/563- CPython
_interpretersmodule.csource: https://github.com/python/cpython/blob/main/Modules/_interpretersmodule.c tractor.spawn._subintmodule docstring (in-tree explanation of the legacy-mode choice and its tradeoffs).
Reproducer
./py314/bin/python -m pytest \
tests/discovery/test_registrar.py::test_stale_entry_is_deleted \
--spawn-backend=subint \
--tb=short --no-header -v
Hangs indefinitely without the fixture-side SIGINT loop; with the loop, the test completes (albeit with the abandoned-thread warning in logs).