Expand `subint` sigint-starvation hang catalog

Add two more tests to the catalog in
`conc-anal/subint_sigint_starvation_issue.md` — same
signal-wakeup-fd-saturation fingerprint (abandoned legacy-subint driver
threads → shared-GIL starvation → `write() = EAGAIN` on the wakeup pipe
→ silent SIGINT drop), different load patterns.

Deats,
- `test_cancel_while_childs_child_in_sync_sleep[subint-False]`: nested
  actor-tree + sync-sleeping grandchild. Under `trio`/`mp_*` the "zombie
  reaper" is a subproc `SIGKILL`; no equivalent exists under subint, so
  the grandchild persists in its abandoned driver thread. Often only
  manifests under full-suite runs (earlier tests seed the
  abandoned-thread pool).

- `test_multierror_fast_nursery[subint-25-0.5]`: 25 concurrent subactors
  all go through teardown on the multierror. Bounded hard-kills run in
  parallel — so the total budget is ~3s, not 3s × 25. Leaves 25
  abandoned driver threads simultaneously alive, an extreme pressure
  multiplier. `strace` shows several successful `write(16, "\2", 1) = 1`
  (GIL round-robin IS giving main brief slices) before finally
  saturating with `EAGAIN`.

Also include a `pstree -snapt <pid>` capture showing
16+ live `{subint-driver[<interp_id>}` threads at the
moment of hang — the direct GIL-contender population.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
subint_forkserver_backend
Gud Boi 2026-04-21 17:42:37 -04:00
parent 985ea76de5
commit f3cea714bc
1 changed files with 145 additions and 0 deletions

View File

@ -203,3 +203,148 @@ work" cause.
Hangs indefinitely without the fixture-side SIGINT loop; Hangs indefinitely without the fixture-side SIGINT loop;
with the loop, the test completes (albeit with the with the loop, the test completes (albeit with the
abandoned-thread warning in logs). abandoned-thread warning in logs).
## Additional known-hanging tests (same class)
All three tests below exhibit the same
signal-wakeup-fd-starvation fingerprint (`write() → EAGAIN`
on the wakeup pipe after enough SIGINT attempts) and
share the same structural cause — abandoned legacy-subint
driver threads contending with the main interpreter for
the shared GIL until the main trio loop can no longer
drain its wakeup pipe fast enough to deliver signals.
They're listed separately because each exposes the class
under a different load pattern worth documenting.
### `tests/discovery/test_registrar.py::test_stale_entry_is_deleted[subint]`
Original exemplar — see the **Symptom** and **Evidence**
sections above. One abandoned subint
(`transport_fails_actor`, stuck in `trio.sleep_forever()`
after self-cancelling its IPC server) is sufficient to
tip main into starvation once the harness's `daemon`
fixture subproc keeps its half of the registry IPC alive.
### `tests/test_cancellation.py::test_cancel_while_childs_child_in_sync_sleep[subint-False]`
Cancel a grandchild that's in sync Python sleep from 2
nurseries up. The test's own docstring declares the
dependency: "its parent should issue a 'zombie reaper' to
hard kill it after sufficient timeout" — which for
`trio`/`mp_*` is an OS-level `SIGKILL` of the grandchild
subproc. **Under `subint` there's no equivalent** (no
public CPython API to force-destroy a running
sub-interpreter), so the grandchild's sync-sleeping
`trio.run()` persists inside its abandoned driver thread
indefinitely. The nested actor-tree (parent → child →
grandchild, all subints) means a single cancel triggers
multiple concurrent hard-kill abandonments, each leaving
a live driver thread.
This test often only manifests the starvation under
**full-suite runs** rather than solo execution —
earlier-in-session subint tests also leave abandoned
driver threads behind, and the combined population is
what actually tips main trio into starvation. Solo runs
may stay Ctrl-C-able with fewer abandoned threads in the
mix.
### `tests/test_cancellation.py::test_multierror_fast_nursery[subint-25-0.5]`
Nursery-error-path throughput stress-test parametrized
for **25 concurrent subactors**. When the multierror
fires and the nursery cancels, every subactor goes
through our `subint_proc` teardown. The bounded
hard-kills run in parallel (all `subint_proc` tasks are
sibling trio tasks), so the timeout budget is ~3s total
rather than 3s × 25. After that, **25 abandoned
`daemon=True` driver threads are simultaneously alive** —
an extreme pressure multiplier on the same mechanism.
The `strace` fingerprint is striking under this load: six
or more **successful** `write(16, "\2", 1) = 1` calls
(main trio getting brief GIL slices, each long enough to
drain exactly one wakeup-pipe byte) before finally
saturating with `EAGAIN`:
```
--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
write(16, "\2", 1) = 1
rt_sigreturn({mask=[WINCH]}) = 140141623162400
--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
write(16, "\2", 1) = 1
rt_sigreturn({mask=[WINCH]}) = 140141623162400
--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
write(16, "\2", 1) = 1
rt_sigreturn({mask=[WINCH]}) = 140141623162400
--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
write(16, "\2", 1) = 1
rt_sigreturn({mask=[WINCH]}) = 140141623162400
--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
write(16, "\2", 1) = 1
rt_sigreturn({mask=[WINCH]}) = 140141623162400
--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
write(16, "\2", 1) = 1
rt_sigreturn({mask=[WINCH]}) = 140141623162400
--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
write(16, "\2", 1) = -1 EAGAIN (Resource temporarily unavailable)
rt_sigreturn({mask=[WINCH]}) = 140141623162400
```
Those successful writes indicate CPython's
`sys.getswitchinterval()`-based GIL round-robin *is*
giving main brief slices — just never long enough to run
the Python-level signal handler through to the point
where trio converts the delivered SIGINT into a
`Cancelled` on the appropriate scope. Once the
accumulated write rate outpaces main's drain rate, the
pipe saturates and subsequent signals are silently
dropped.
The `pstree` below (pid `530060` = hung `pytest`) shows
the subint-driver thread population at the moment of
capture. Even with fewer than the full 25 shown (pstree
truncates thread names to `subint-driver[<interp_id>`
interpreters `3` and `4` visible across 16 thread
entries), the GIL-contender count is more than enough to
explain the starvation:
```
>>> pstree -snapt 530060
systemd,1 --switched-root --system --deserialize=40
└─login,1545 --
└─bash,1872
└─sway,2012
└─alacritty,70471 -e xonsh
└─xonsh,70487 .../bin/xonsh
└─uv,70955 run xonsh
└─xonsh,70959 .../py314/bin/xonsh
└─python,530060 .../py314/bin/pytest -v tests/test_cancellation.py --spawn-backend=subint
├─{subint-driver[3},531857
├─{subint-driver[3},531860
├─{subint-driver[3},531862
├─{subint-driver[3},531866
├─{subint-driver[3},531877
├─{subint-driver[3},531882
├─{subint-driver[3},531884
├─{subint-driver[3},531945
├─{subint-driver[3},531950
├─{subint-driver[3},531952
├─{subint-driver[4},531956
├─{subint-driver[4},531959
├─{subint-driver[4},531961
├─{subint-driver[4},531965
├─{subint-driver[4},531968
└─{subint-driver[4},531979
```
(`pstree` uses `{...}` to denote threads rather than
processes — these are all the **driver OS-threads** our
`subint_proc` creates with name
`f'subint-driver[{interp_id}]'`. Every one of them is
still alive, executing `_interpreters.exec()` inside a
sub-interpreter our hard-kill has abandoned. At 16+
abandoned driver threads competing for the main GIL, the
main-interpreter trio loop gets starved and signal
delivery stalls.)