Expand `subint` sigint-starvation hang catalog
Add two more tests to the catalog in
`conc-anal/subint_sigint_starvation_issue.md` — same
signal-wakeup-fd-saturation fingerprint (abandoned legacy-subint driver
threads → shared-GIL starvation → `write() = EAGAIN` on the wakeup pipe
→ silent SIGINT drop), different load patterns.
Deats,
- `test_cancel_while_childs_child_in_sync_sleep[subint-False]`: nested
actor-tree + sync-sleeping grandchild. Under `trio`/`mp_*` the "zombie
reaper" is a subproc `SIGKILL`; no equivalent exists under subint, so
the grandchild persists in its abandoned driver thread. Often only
manifests under full-suite runs (earlier tests seed the
abandoned-thread pool).
- `test_multierror_fast_nursery[subint-25-0.5]`: 25 concurrent subactors
all go through teardown on the multierror. Bounded hard-kills run in
parallel — so the total budget is ~3s, not 3s × 25. Leaves 25
abandoned driver threads simultaneously alive, an extreme pressure
multiplier. `strace` shows several successful `write(16, "\2", 1) = 1`
(GIL round-robin IS giving main brief slices) before finally
saturating with `EAGAIN`.
Also include a `pstree -snapt <pid>` capture showing
16+ live `{subint-driver[<interp_id>}` threads at the
moment of hang — the direct GIL-contender population.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
subint_forkserver_backend
parent
985ea76de5
commit
f3cea714bc
|
|
@ -203,3 +203,148 @@ work" cause.
|
||||||
Hangs indefinitely without the fixture-side SIGINT loop;
|
Hangs indefinitely without the fixture-side SIGINT loop;
|
||||||
with the loop, the test completes (albeit with the
|
with the loop, the test completes (albeit with the
|
||||||
abandoned-thread warning in logs).
|
abandoned-thread warning in logs).
|
||||||
|
|
||||||
|
## Additional known-hanging tests (same class)
|
||||||
|
|
||||||
|
All three tests below exhibit the same
|
||||||
|
signal-wakeup-fd-starvation fingerprint (`write() → EAGAIN`
|
||||||
|
on the wakeup pipe after enough SIGINT attempts) and
|
||||||
|
share the same structural cause — abandoned legacy-subint
|
||||||
|
driver threads contending with the main interpreter for
|
||||||
|
the shared GIL until the main trio loop can no longer
|
||||||
|
drain its wakeup pipe fast enough to deliver signals.
|
||||||
|
|
||||||
|
They're listed separately because each exposes the class
|
||||||
|
under a different load pattern worth documenting.
|
||||||
|
|
||||||
|
### `tests/discovery/test_registrar.py::test_stale_entry_is_deleted[subint]`
|
||||||
|
|
||||||
|
Original exemplar — see the **Symptom** and **Evidence**
|
||||||
|
sections above. One abandoned subint
|
||||||
|
(`transport_fails_actor`, stuck in `trio.sleep_forever()`
|
||||||
|
after self-cancelling its IPC server) is sufficient to
|
||||||
|
tip main into starvation once the harness's `daemon`
|
||||||
|
fixture subproc keeps its half of the registry IPC alive.
|
||||||
|
|
||||||
|
### `tests/test_cancellation.py::test_cancel_while_childs_child_in_sync_sleep[subint-False]`
|
||||||
|
|
||||||
|
Cancel a grandchild that's in sync Python sleep from 2
|
||||||
|
nurseries up. The test's own docstring declares the
|
||||||
|
dependency: "its parent should issue a 'zombie reaper' to
|
||||||
|
hard kill it after sufficient timeout" — which for
|
||||||
|
`trio`/`mp_*` is an OS-level `SIGKILL` of the grandchild
|
||||||
|
subproc. **Under `subint` there's no equivalent** (no
|
||||||
|
public CPython API to force-destroy a running
|
||||||
|
sub-interpreter), so the grandchild's sync-sleeping
|
||||||
|
`trio.run()` persists inside its abandoned driver thread
|
||||||
|
indefinitely. The nested actor-tree (parent → child →
|
||||||
|
grandchild, all subints) means a single cancel triggers
|
||||||
|
multiple concurrent hard-kill abandonments, each leaving
|
||||||
|
a live driver thread.
|
||||||
|
|
||||||
|
This test often only manifests the starvation under
|
||||||
|
**full-suite runs** rather than solo execution —
|
||||||
|
earlier-in-session subint tests also leave abandoned
|
||||||
|
driver threads behind, and the combined population is
|
||||||
|
what actually tips main trio into starvation. Solo runs
|
||||||
|
may stay Ctrl-C-able with fewer abandoned threads in the
|
||||||
|
mix.
|
||||||
|
|
||||||
|
### `tests/test_cancellation.py::test_multierror_fast_nursery[subint-25-0.5]`
|
||||||
|
|
||||||
|
Nursery-error-path throughput stress-test parametrized
|
||||||
|
for **25 concurrent subactors**. When the multierror
|
||||||
|
fires and the nursery cancels, every subactor goes
|
||||||
|
through our `subint_proc` teardown. The bounded
|
||||||
|
hard-kills run in parallel (all `subint_proc` tasks are
|
||||||
|
sibling trio tasks), so the timeout budget is ~3s total
|
||||||
|
rather than 3s × 25. After that, **25 abandoned
|
||||||
|
`daemon=True` driver threads are simultaneously alive** —
|
||||||
|
an extreme pressure multiplier on the same mechanism.
|
||||||
|
|
||||||
|
The `strace` fingerprint is striking under this load: six
|
||||||
|
or more **successful** `write(16, "\2", 1) = 1` calls
|
||||||
|
(main trio getting brief GIL slices, each long enough to
|
||||||
|
drain exactly one wakeup-pipe byte) before finally
|
||||||
|
saturating with `EAGAIN`:
|
||||||
|
|
||||||
|
```
|
||||||
|
--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
|
||||||
|
write(16, "\2", 1) = 1
|
||||||
|
rt_sigreturn({mask=[WINCH]}) = 140141623162400
|
||||||
|
--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
|
||||||
|
write(16, "\2", 1) = 1
|
||||||
|
rt_sigreturn({mask=[WINCH]}) = 140141623162400
|
||||||
|
--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
|
||||||
|
write(16, "\2", 1) = 1
|
||||||
|
rt_sigreturn({mask=[WINCH]}) = 140141623162400
|
||||||
|
--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
|
||||||
|
write(16, "\2", 1) = 1
|
||||||
|
rt_sigreturn({mask=[WINCH]}) = 140141623162400
|
||||||
|
--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
|
||||||
|
write(16, "\2", 1) = 1
|
||||||
|
rt_sigreturn({mask=[WINCH]}) = 140141623162400
|
||||||
|
--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
|
||||||
|
write(16, "\2", 1) = 1
|
||||||
|
rt_sigreturn({mask=[WINCH]}) = 140141623162400
|
||||||
|
--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
|
||||||
|
write(16, "\2", 1) = -1 EAGAIN (Resource temporarily unavailable)
|
||||||
|
rt_sigreturn({mask=[WINCH]}) = 140141623162400
|
||||||
|
```
|
||||||
|
|
||||||
|
Those successful writes indicate CPython's
|
||||||
|
`sys.getswitchinterval()`-based GIL round-robin *is*
|
||||||
|
giving main brief slices — just never long enough to run
|
||||||
|
the Python-level signal handler through to the point
|
||||||
|
where trio converts the delivered SIGINT into a
|
||||||
|
`Cancelled` on the appropriate scope. Once the
|
||||||
|
accumulated write rate outpaces main's drain rate, the
|
||||||
|
pipe saturates and subsequent signals are silently
|
||||||
|
dropped.
|
||||||
|
|
||||||
|
The `pstree` below (pid `530060` = hung `pytest`) shows
|
||||||
|
the subint-driver thread population at the moment of
|
||||||
|
capture. Even with fewer than the full 25 shown (pstree
|
||||||
|
truncates thread names to `subint-driver[<interp_id>` —
|
||||||
|
interpreters `3` and `4` visible across 16 thread
|
||||||
|
entries), the GIL-contender count is more than enough to
|
||||||
|
explain the starvation:
|
||||||
|
|
||||||
|
```
|
||||||
|
>>> pstree -snapt 530060
|
||||||
|
systemd,1 --switched-root --system --deserialize=40
|
||||||
|
└─login,1545 --
|
||||||
|
└─bash,1872
|
||||||
|
└─sway,2012
|
||||||
|
└─alacritty,70471 -e xonsh
|
||||||
|
└─xonsh,70487 .../bin/xonsh
|
||||||
|
└─uv,70955 run xonsh
|
||||||
|
└─xonsh,70959 .../py314/bin/xonsh
|
||||||
|
└─python,530060 .../py314/bin/pytest -v tests/test_cancellation.py --spawn-backend=subint
|
||||||
|
├─{subint-driver[3},531857
|
||||||
|
├─{subint-driver[3},531860
|
||||||
|
├─{subint-driver[3},531862
|
||||||
|
├─{subint-driver[3},531866
|
||||||
|
├─{subint-driver[3},531877
|
||||||
|
├─{subint-driver[3},531882
|
||||||
|
├─{subint-driver[3},531884
|
||||||
|
├─{subint-driver[3},531945
|
||||||
|
├─{subint-driver[3},531950
|
||||||
|
├─{subint-driver[3},531952
|
||||||
|
├─{subint-driver[4},531956
|
||||||
|
├─{subint-driver[4},531959
|
||||||
|
├─{subint-driver[4},531961
|
||||||
|
├─{subint-driver[4},531965
|
||||||
|
├─{subint-driver[4},531968
|
||||||
|
└─{subint-driver[4},531979
|
||||||
|
```
|
||||||
|
|
||||||
|
(`pstree` uses `{...}` to denote threads rather than
|
||||||
|
processes — these are all the **driver OS-threads** our
|
||||||
|
`subint_proc` creates with name
|
||||||
|
`f'subint-driver[{interp_id}]'`. Every one of them is
|
||||||
|
still alive, executing `_interpreters.exec()` inside a
|
||||||
|
sub-interpreter our hard-kill has abandoned. At 16+
|
||||||
|
abandoned driver threads competing for the main GIL, the
|
||||||
|
main-interpreter trio loop gets starved and signal
|
||||||
|
delivery stalls.)
|
||||||
|
|
|
||||||
Loading…
Reference in New Issue