Expand `subint` sigint-starvation hang catalog
Add two more tests to the catalog in
`conc-anal/subint_sigint_starvation_issue.md` — same
signal-wakeup-fd-saturation fingerprint (abandoned legacy-subint driver
threads → shared-GIL starvation → `write() = EAGAIN` on the wakeup pipe
→ silent SIGINT drop), different load patterns.
Deats,
- `test_cancel_while_childs_child_in_sync_sleep[subint-False]`: nested
actor-tree + sync-sleeping grandchild. Under `trio`/`mp_*` the "zombie
reaper" is a subproc `SIGKILL`; no equivalent exists under subint, so
the grandchild persists in its abandoned driver thread. Often only
manifests under full-suite runs (earlier tests seed the
abandoned-thread pool).
- `test_multierror_fast_nursery[subint-25-0.5]`: 25 concurrent subactors
all go through teardown on the multierror. Bounded hard-kills run in
parallel — so the total budget is ~3s, not 3s × 25. Leaves 25
abandoned driver threads simultaneously alive, an extreme pressure
multiplier. `strace` shows several successful `write(16, "\2", 1) = 1`
(GIL round-robin IS giving main brief slices) before finally
saturating with `EAGAIN`.
Also include a `pstree -snapt <pid>` capture showing
16+ live `{subint-driver[<interp_id>}` threads at the
moment of hang — the direct GIL-contender population.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
subint_forkserver_backend
parent
985ea76de5
commit
f3cea714bc
|
|
@ -203,3 +203,148 @@ work" cause.
|
|||
Hangs indefinitely without the fixture-side SIGINT loop;
|
||||
with the loop, the test completes (albeit with the
|
||||
abandoned-thread warning in logs).
|
||||
|
||||
## Additional known-hanging tests (same class)
|
||||
|
||||
All three tests below exhibit the same
|
||||
signal-wakeup-fd-starvation fingerprint (`write() → EAGAIN`
|
||||
on the wakeup pipe after enough SIGINT attempts) and
|
||||
share the same structural cause — abandoned legacy-subint
|
||||
driver threads contending with the main interpreter for
|
||||
the shared GIL until the main trio loop can no longer
|
||||
drain its wakeup pipe fast enough to deliver signals.
|
||||
|
||||
They're listed separately because each exposes the class
|
||||
under a different load pattern worth documenting.
|
||||
|
||||
### `tests/discovery/test_registrar.py::test_stale_entry_is_deleted[subint]`
|
||||
|
||||
Original exemplar — see the **Symptom** and **Evidence**
|
||||
sections above. One abandoned subint
|
||||
(`transport_fails_actor`, stuck in `trio.sleep_forever()`
|
||||
after self-cancelling its IPC server) is sufficient to
|
||||
tip main into starvation once the harness's `daemon`
|
||||
fixture subproc keeps its half of the registry IPC alive.
|
||||
|
||||
### `tests/test_cancellation.py::test_cancel_while_childs_child_in_sync_sleep[subint-False]`
|
||||
|
||||
Cancel a grandchild that's in sync Python sleep from 2
|
||||
nurseries up. The test's own docstring declares the
|
||||
dependency: "its parent should issue a 'zombie reaper' to
|
||||
hard kill it after sufficient timeout" — which for
|
||||
`trio`/`mp_*` is an OS-level `SIGKILL` of the grandchild
|
||||
subproc. **Under `subint` there's no equivalent** (no
|
||||
public CPython API to force-destroy a running
|
||||
sub-interpreter), so the grandchild's sync-sleeping
|
||||
`trio.run()` persists inside its abandoned driver thread
|
||||
indefinitely. The nested actor-tree (parent → child →
|
||||
grandchild, all subints) means a single cancel triggers
|
||||
multiple concurrent hard-kill abandonments, each leaving
|
||||
a live driver thread.
|
||||
|
||||
This test often only manifests the starvation under
|
||||
**full-suite runs** rather than solo execution —
|
||||
earlier-in-session subint tests also leave abandoned
|
||||
driver threads behind, and the combined population is
|
||||
what actually tips main trio into starvation. Solo runs
|
||||
may stay Ctrl-C-able with fewer abandoned threads in the
|
||||
mix.
|
||||
|
||||
### `tests/test_cancellation.py::test_multierror_fast_nursery[subint-25-0.5]`
|
||||
|
||||
Nursery-error-path throughput stress-test parametrized
|
||||
for **25 concurrent subactors**. When the multierror
|
||||
fires and the nursery cancels, every subactor goes
|
||||
through our `subint_proc` teardown. The bounded
|
||||
hard-kills run in parallel (all `subint_proc` tasks are
|
||||
sibling trio tasks), so the timeout budget is ~3s total
|
||||
rather than 3s × 25. After that, **25 abandoned
|
||||
`daemon=True` driver threads are simultaneously alive** —
|
||||
an extreme pressure multiplier on the same mechanism.
|
||||
|
||||
The `strace` fingerprint is striking under this load: six
|
||||
or more **successful** `write(16, "\2", 1) = 1` calls
|
||||
(main trio getting brief GIL slices, each long enough to
|
||||
drain exactly one wakeup-pipe byte) before finally
|
||||
saturating with `EAGAIN`:
|
||||
|
||||
```
|
||||
--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
|
||||
write(16, "\2", 1) = 1
|
||||
rt_sigreturn({mask=[WINCH]}) = 140141623162400
|
||||
--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
|
||||
write(16, "\2", 1) = 1
|
||||
rt_sigreturn({mask=[WINCH]}) = 140141623162400
|
||||
--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
|
||||
write(16, "\2", 1) = 1
|
||||
rt_sigreturn({mask=[WINCH]}) = 140141623162400
|
||||
--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
|
||||
write(16, "\2", 1) = 1
|
||||
rt_sigreturn({mask=[WINCH]}) = 140141623162400
|
||||
--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
|
||||
write(16, "\2", 1) = 1
|
||||
rt_sigreturn({mask=[WINCH]}) = 140141623162400
|
||||
--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
|
||||
write(16, "\2", 1) = 1
|
||||
rt_sigreturn({mask=[WINCH]}) = 140141623162400
|
||||
--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
|
||||
write(16, "\2", 1) = -1 EAGAIN (Resource temporarily unavailable)
|
||||
rt_sigreturn({mask=[WINCH]}) = 140141623162400
|
||||
```
|
||||
|
||||
Those successful writes indicate CPython's
|
||||
`sys.getswitchinterval()`-based GIL round-robin *is*
|
||||
giving main brief slices — just never long enough to run
|
||||
the Python-level signal handler through to the point
|
||||
where trio converts the delivered SIGINT into a
|
||||
`Cancelled` on the appropriate scope. Once the
|
||||
accumulated write rate outpaces main's drain rate, the
|
||||
pipe saturates and subsequent signals are silently
|
||||
dropped.
|
||||
|
||||
The `pstree` below (pid `530060` = hung `pytest`) shows
|
||||
the subint-driver thread population at the moment of
|
||||
capture. Even with fewer than the full 25 shown (pstree
|
||||
truncates thread names to `subint-driver[<interp_id>` —
|
||||
interpreters `3` and `4` visible across 16 thread
|
||||
entries), the GIL-contender count is more than enough to
|
||||
explain the starvation:
|
||||
|
||||
```
|
||||
>>> pstree -snapt 530060
|
||||
systemd,1 --switched-root --system --deserialize=40
|
||||
└─login,1545 --
|
||||
└─bash,1872
|
||||
└─sway,2012
|
||||
└─alacritty,70471 -e xonsh
|
||||
└─xonsh,70487 .../bin/xonsh
|
||||
└─uv,70955 run xonsh
|
||||
└─xonsh,70959 .../py314/bin/xonsh
|
||||
└─python,530060 .../py314/bin/pytest -v tests/test_cancellation.py --spawn-backend=subint
|
||||
├─{subint-driver[3},531857
|
||||
├─{subint-driver[3},531860
|
||||
├─{subint-driver[3},531862
|
||||
├─{subint-driver[3},531866
|
||||
├─{subint-driver[3},531877
|
||||
├─{subint-driver[3},531882
|
||||
├─{subint-driver[3},531884
|
||||
├─{subint-driver[3},531945
|
||||
├─{subint-driver[3},531950
|
||||
├─{subint-driver[3},531952
|
||||
├─{subint-driver[4},531956
|
||||
├─{subint-driver[4},531959
|
||||
├─{subint-driver[4},531961
|
||||
├─{subint-driver[4},531965
|
||||
├─{subint-driver[4},531968
|
||||
└─{subint-driver[4},531979
|
||||
```
|
||||
|
||||
(`pstree` uses `{...}` to denote threads rather than
|
||||
processes — these are all the **driver OS-threads** our
|
||||
`subint_proc` creates with name
|
||||
`f'subint-driver[{interp_id}]'`. Every one of them is
|
||||
still alive, executing `_interpreters.exec()` inside a
|
||||
sub-interpreter our hard-kill has abandoned. At 16+
|
||||
abandoned driver threads competing for the main GIL, the
|
||||
main-interpreter trio loop gets starved and signal
|
||||
delivery stalls.)
|
||||
|
|
|
|||
Loading…
Reference in New Issue