diff --git a/ai/conc-anal/subint_sigint_starvation_issue.md b/ai/conc-anal/subint_sigint_starvation_issue.md index 438b7b8a..60b266c6 100644 --- a/ai/conc-anal/subint_sigint_starvation_issue.md +++ b/ai/conc-anal/subint_sigint_starvation_issue.md @@ -203,3 +203,148 @@ work" cause. Hangs indefinitely without the fixture-side SIGINT loop; with the loop, the test completes (albeit with the abandoned-thread warning in logs). + +## Additional known-hanging tests (same class) + +All three tests below exhibit the same +signal-wakeup-fd-starvation fingerprint (`write() → EAGAIN` +on the wakeup pipe after enough SIGINT attempts) and +share the same structural cause — abandoned legacy-subint +driver threads contending with the main interpreter for +the shared GIL until the main trio loop can no longer +drain its wakeup pipe fast enough to deliver signals. + +They're listed separately because each exposes the class +under a different load pattern worth documenting. + +### `tests/discovery/test_registrar.py::test_stale_entry_is_deleted[subint]` + +Original exemplar — see the **Symptom** and **Evidence** +sections above. One abandoned subint +(`transport_fails_actor`, stuck in `trio.sleep_forever()` +after self-cancelling its IPC server) is sufficient to +tip main into starvation once the harness's `daemon` +fixture subproc keeps its half of the registry IPC alive. + +### `tests/test_cancellation.py::test_cancel_while_childs_child_in_sync_sleep[subint-False]` + +Cancel a grandchild that's in sync Python sleep from 2 +nurseries up. The test's own docstring declares the +dependency: "its parent should issue a 'zombie reaper' to +hard kill it after sufficient timeout" — which for +`trio`/`mp_*` is an OS-level `SIGKILL` of the grandchild +subproc. **Under `subint` there's no equivalent** (no +public CPython API to force-destroy a running +sub-interpreter), so the grandchild's sync-sleeping +`trio.run()` persists inside its abandoned driver thread +indefinitely. The nested actor-tree (parent → child → +grandchild, all subints) means a single cancel triggers +multiple concurrent hard-kill abandonments, each leaving +a live driver thread. + +This test often only manifests the starvation under +**full-suite runs** rather than solo execution — +earlier-in-session subint tests also leave abandoned +driver threads behind, and the combined population is +what actually tips main trio into starvation. Solo runs +may stay Ctrl-C-able with fewer abandoned threads in the +mix. + +### `tests/test_cancellation.py::test_multierror_fast_nursery[subint-25-0.5]` + +Nursery-error-path throughput stress-test parametrized +for **25 concurrent subactors**. When the multierror +fires and the nursery cancels, every subactor goes +through our `subint_proc` teardown. The bounded +hard-kills run in parallel (all `subint_proc` tasks are +sibling trio tasks), so the timeout budget is ~3s total +rather than 3s × 25. After that, **25 abandoned +`daemon=True` driver threads are simultaneously alive** — +an extreme pressure multiplier on the same mechanism. + +The `strace` fingerprint is striking under this load: six +or more **successful** `write(16, "\2", 1) = 1` calls +(main trio getting brief GIL slices, each long enough to +drain exactly one wakeup-pipe byte) before finally +saturating with `EAGAIN`: + +``` +--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} --- +write(16, "\2", 1) = 1 +rt_sigreturn({mask=[WINCH]}) = 140141623162400 +--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} --- +write(16, "\2", 1) = 1 +rt_sigreturn({mask=[WINCH]}) = 140141623162400 +--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} --- +write(16, "\2", 1) = 1 +rt_sigreturn({mask=[WINCH]}) = 140141623162400 +--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} --- +write(16, "\2", 1) = 1 +rt_sigreturn({mask=[WINCH]}) = 140141623162400 +--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} --- +write(16, "\2", 1) = 1 +rt_sigreturn({mask=[WINCH]}) = 140141623162400 +--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} --- +write(16, "\2", 1) = 1 +rt_sigreturn({mask=[WINCH]}) = 140141623162400 +--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} --- +write(16, "\2", 1) = -1 EAGAIN (Resource temporarily unavailable) +rt_sigreturn({mask=[WINCH]}) = 140141623162400 +``` + +Those successful writes indicate CPython's +`sys.getswitchinterval()`-based GIL round-robin *is* +giving main brief slices — just never long enough to run +the Python-level signal handler through to the point +where trio converts the delivered SIGINT into a +`Cancelled` on the appropriate scope. Once the +accumulated write rate outpaces main's drain rate, the +pipe saturates and subsequent signals are silently +dropped. + +The `pstree` below (pid `530060` = hung `pytest`) shows +the subint-driver thread population at the moment of +capture. Even with fewer than the full 25 shown (pstree +truncates thread names to `subint-driver[` — +interpreters `3` and `4` visible across 16 thread +entries), the GIL-contender count is more than enough to +explain the starvation: + +``` +>>> pstree -snapt 530060 +systemd,1 --switched-root --system --deserialize=40 + └─login,1545 -- + └─bash,1872 + └─sway,2012 + └─alacritty,70471 -e xonsh + └─xonsh,70487 .../bin/xonsh + └─uv,70955 run xonsh + └─xonsh,70959 .../py314/bin/xonsh + └─python,530060 .../py314/bin/pytest -v tests/test_cancellation.py --spawn-backend=subint + ├─{subint-driver[3},531857 + ├─{subint-driver[3},531860 + ├─{subint-driver[3},531862 + ├─{subint-driver[3},531866 + ├─{subint-driver[3},531877 + ├─{subint-driver[3},531882 + ├─{subint-driver[3},531884 + ├─{subint-driver[3},531945 + ├─{subint-driver[3},531950 + ├─{subint-driver[3},531952 + ├─{subint-driver[4},531956 + ├─{subint-driver[4},531959 + ├─{subint-driver[4},531961 + ├─{subint-driver[4},531965 + ├─{subint-driver[4},531968 + └─{subint-driver[4},531979 +``` + +(`pstree` uses `{...}` to denote threads rather than +processes — these are all the **driver OS-threads** our +`subint_proc` creates with name +`f'subint-driver[{interp_id}]'`. Every one of them is +still alive, executing `_interpreters.exec()` inside a +sub-interpreter our hard-kill has abandoned. At 16+ +abandoned driver threads competing for the main GIL, the +main-interpreter trio loop gets starved and signal +delivery stalls.)