From f3cea714bcca0fc49b8170dab8014ca11008a968 Mon Sep 17 00:00:00 2001 From: goodboy Date: Tue, 21 Apr 2026 17:42:37 -0400 Subject: [PATCH] Expand `subint` sigint-starvation hang catalog MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add two more tests to the catalog in `conc-anal/subint_sigint_starvation_issue.md` — same signal-wakeup-fd-saturation fingerprint (abandoned legacy-subint driver threads → shared-GIL starvation → `write() = EAGAIN` on the wakeup pipe → silent SIGINT drop), different load patterns. Deats, - `test_cancel_while_childs_child_in_sync_sleep[subint-False]`: nested actor-tree + sync-sleeping grandchild. Under `trio`/`mp_*` the "zombie reaper" is a subproc `SIGKILL`; no equivalent exists under subint, so the grandchild persists in its abandoned driver thread. Often only manifests under full-suite runs (earlier tests seed the abandoned-thread pool). - `test_multierror_fast_nursery[subint-25-0.5]`: 25 concurrent subactors all go through teardown on the multierror. Bounded hard-kills run in parallel — so the total budget is ~3s, not 3s × 25. Leaves 25 abandoned driver threads simultaneously alive, an extreme pressure multiplier. `strace` shows several successful `write(16, "\2", 1) = 1` (GIL round-robin IS giving main brief slices) before finally saturating with `EAGAIN`. Also include a `pstree -snapt ` capture showing 16+ live `{subint-driver[}` threads at the moment of hang — the direct GIL-contender population. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code --- .../subint_sigint_starvation_issue.md | 145 ++++++++++++++++++ 1 file changed, 145 insertions(+) diff --git a/ai/conc-anal/subint_sigint_starvation_issue.md b/ai/conc-anal/subint_sigint_starvation_issue.md index 438b7b8a..60b266c6 100644 --- a/ai/conc-anal/subint_sigint_starvation_issue.md +++ b/ai/conc-anal/subint_sigint_starvation_issue.md @@ -203,3 +203,148 @@ work" cause. Hangs indefinitely without the fixture-side SIGINT loop; with the loop, the test completes (albeit with the abandoned-thread warning in logs). + +## Additional known-hanging tests (same class) + +All three tests below exhibit the same +signal-wakeup-fd-starvation fingerprint (`write() → EAGAIN` +on the wakeup pipe after enough SIGINT attempts) and +share the same structural cause — abandoned legacy-subint +driver threads contending with the main interpreter for +the shared GIL until the main trio loop can no longer +drain its wakeup pipe fast enough to deliver signals. + +They're listed separately because each exposes the class +under a different load pattern worth documenting. + +### `tests/discovery/test_registrar.py::test_stale_entry_is_deleted[subint]` + +Original exemplar — see the **Symptom** and **Evidence** +sections above. One abandoned subint +(`transport_fails_actor`, stuck in `trio.sleep_forever()` +after self-cancelling its IPC server) is sufficient to +tip main into starvation once the harness's `daemon` +fixture subproc keeps its half of the registry IPC alive. + +### `tests/test_cancellation.py::test_cancel_while_childs_child_in_sync_sleep[subint-False]` + +Cancel a grandchild that's in sync Python sleep from 2 +nurseries up. The test's own docstring declares the +dependency: "its parent should issue a 'zombie reaper' to +hard kill it after sufficient timeout" — which for +`trio`/`mp_*` is an OS-level `SIGKILL` of the grandchild +subproc. **Under `subint` there's no equivalent** (no +public CPython API to force-destroy a running +sub-interpreter), so the grandchild's sync-sleeping +`trio.run()` persists inside its abandoned driver thread +indefinitely. The nested actor-tree (parent → child → +grandchild, all subints) means a single cancel triggers +multiple concurrent hard-kill abandonments, each leaving +a live driver thread. + +This test often only manifests the starvation under +**full-suite runs** rather than solo execution — +earlier-in-session subint tests also leave abandoned +driver threads behind, and the combined population is +what actually tips main trio into starvation. Solo runs +may stay Ctrl-C-able with fewer abandoned threads in the +mix. + +### `tests/test_cancellation.py::test_multierror_fast_nursery[subint-25-0.5]` + +Nursery-error-path throughput stress-test parametrized +for **25 concurrent subactors**. When the multierror +fires and the nursery cancels, every subactor goes +through our `subint_proc` teardown. The bounded +hard-kills run in parallel (all `subint_proc` tasks are +sibling trio tasks), so the timeout budget is ~3s total +rather than 3s × 25. After that, **25 abandoned +`daemon=True` driver threads are simultaneously alive** — +an extreme pressure multiplier on the same mechanism. + +The `strace` fingerprint is striking under this load: six +or more **successful** `write(16, "\2", 1) = 1` calls +(main trio getting brief GIL slices, each long enough to +drain exactly one wakeup-pipe byte) before finally +saturating with `EAGAIN`: + +``` +--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} --- +write(16, "\2", 1) = 1 +rt_sigreturn({mask=[WINCH]}) = 140141623162400 +--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} --- +write(16, "\2", 1) = 1 +rt_sigreturn({mask=[WINCH]}) = 140141623162400 +--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} --- +write(16, "\2", 1) = 1 +rt_sigreturn({mask=[WINCH]}) = 140141623162400 +--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} --- +write(16, "\2", 1) = 1 +rt_sigreturn({mask=[WINCH]}) = 140141623162400 +--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} --- +write(16, "\2", 1) = 1 +rt_sigreturn({mask=[WINCH]}) = 140141623162400 +--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} --- +write(16, "\2", 1) = 1 +rt_sigreturn({mask=[WINCH]}) = 140141623162400 +--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} --- +write(16, "\2", 1) = -1 EAGAIN (Resource temporarily unavailable) +rt_sigreturn({mask=[WINCH]}) = 140141623162400 +``` + +Those successful writes indicate CPython's +`sys.getswitchinterval()`-based GIL round-robin *is* +giving main brief slices — just never long enough to run +the Python-level signal handler through to the point +where trio converts the delivered SIGINT into a +`Cancelled` on the appropriate scope. Once the +accumulated write rate outpaces main's drain rate, the +pipe saturates and subsequent signals are silently +dropped. + +The `pstree` below (pid `530060` = hung `pytest`) shows +the subint-driver thread population at the moment of +capture. Even with fewer than the full 25 shown (pstree +truncates thread names to `subint-driver[` — +interpreters `3` and `4` visible across 16 thread +entries), the GIL-contender count is more than enough to +explain the starvation: + +``` +>>> pstree -snapt 530060 +systemd,1 --switched-root --system --deserialize=40 + └─login,1545 -- + └─bash,1872 + └─sway,2012 + └─alacritty,70471 -e xonsh + └─xonsh,70487 .../bin/xonsh + └─uv,70955 run xonsh + └─xonsh,70959 .../py314/bin/xonsh + └─python,530060 .../py314/bin/pytest -v tests/test_cancellation.py --spawn-backend=subint + ├─{subint-driver[3},531857 + ├─{subint-driver[3},531860 + ├─{subint-driver[3},531862 + ├─{subint-driver[3},531866 + ├─{subint-driver[3},531877 + ├─{subint-driver[3},531882 + ├─{subint-driver[3},531884 + ├─{subint-driver[3},531945 + ├─{subint-driver[3},531950 + ├─{subint-driver[3},531952 + ├─{subint-driver[4},531956 + ├─{subint-driver[4},531959 + ├─{subint-driver[4},531961 + ├─{subint-driver[4},531965 + ├─{subint-driver[4},531968 + └─{subint-driver[4},531979 +``` + +(`pstree` uses `{...}` to denote threads rather than +processes — these are all the **driver OS-threads** our +`subint_proc` creates with name +`f'subint-driver[{interp_id}]'`. Every one of them is +still alive, executing `_interpreters.exec()` inside a +sub-interpreter our hard-kill has abandoned. At 16+ +abandoned driver threads competing for the main GIL, the +main-interpreter trio loop gets starved and signal +delivery stalls.)