Add `subint_forkserver` PEP 684 audit-plan doc

Follow-up tracker companion to the module-docstring TODO
added in `372a0f32`. Catalogs why `_subint_forkserver`'s
two "non-trio thread" constraints
(`fork_from_worker_thread()` +
`run_subint_in_worker_thread()` both allocating dedicated
`threading.Thread`s; test helper named
`run_fork_in_non_trio_thread`) exist today, and which of
them would dissolve once msgspec PEP 684 support ships
(`msgspec#563`) and tractor flips to isolated-mode subints.

Deats,
- three reasons enumerated for the current constraints:
  - class-A GIL-starvation — **fixed** by isolated mode:
    subints don't share main's GIL so abandoned-thread
    contention disappears
  - destroy race / tstate-recycling from `subint_proc` —
    **unclear**: `_PyXI_Enter` + `_PyXI_Exit` are
    cross-mode, so isolated doesn't obviously fix it;
    needs empirical retest on py3.14 + isolated API
  - fork-from-main-interp-tstate (the CPython-level
    `_PyInterpreterState_DeleteExceptMain` gate) — the
    narrow reason for using a dedicated thread; **probably
    fixed** IF the destroy-race also resolves (bc trio's
    cache threads never drove subints → clean main-interp
    tstate)
- TL;DR table of which constraints unwind under each
  resolution branch
- four-step audit plan for when `msgspec#563` lands:
  - flip `_subint` to isolated mode
  - empirical destroy-race retest
  - audit `_subint_forkserver.py` — drop `non_trio`
    qualifier / maybe inline primitives
  - doc fallout — close the three `subint_*_issue.md`
    siblings w/ post-mortem notes

Also, cross-refs the three sibling `conc-anal/` docs, PEPs
684 + 734, `msgspec#563`, and `tractor#379` (the overall
subint spawn-backend tracking issue).

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
subint_forkserver_backend
Gud Boi 2026-04-22 18:18:30 -04:00
parent 25e400d526
commit cf2e71d87f
1 changed files with 184 additions and 0 deletions

View File

@ -0,0 +1,184 @@
# Revisit `subint_forkserver` thread-cache constraints once msgspec PEP 684 support lands
Follow-up tracker for cleanup work gated on the msgspec
PEP 684 adoption upstream ([jcrist/msgspec#563](https://github.com/jcrist/msgspec/issues/563)).
Context — why this exists
-------------------------
The `tractor.spawn._subint_forkserver` submodule currently
carries two "non-trio" thread-hygiene constraints whose
necessity is tangled with issues that *should* dissolve
under PEP 684 isolated-mode subinterpreters:
1. `fork_from_worker_thread()` / `run_subint_in_worker_thread()`
internally allocate a **dedicated `threading.Thread`**
rather than using `trio.to_thread.run_sync()`.
2. The test helper is named
`run_fork_in_non_trio_thread()` — the
`non_trio` qualifier is load-bearing today.
This doc catalogs *why* those constraints exist, which of
them isolated-mode would fix, and what the
audit-and-cleanup path looks like once msgspec #563 is
resolved.
The three reasons the constraints exist
---------------------------------------
### 1. GIL-starvation class → fixed by PEP 684 isolated mode
The class-A hang documented in
`subint_sigint_starvation_issue.md` is entirely about
legacy-config subints **sharing the main GIL**. Once
msgspec #563 lands and tractor flips
`tractor.spawn._subint` to
`concurrent.interpreters.create()` (isolated config), each
subint gets its own GIL. Abandoned subint threads can't
contend for main's GIL → can't starve the main trio loop
→ signal-wakeup-pipe drains normally → no SIGINT-drop.
This class of hazard **dissolves entirely**. The
non-trio-thread requirement for *this reason* disappears.
### 2. Destroy race / tstate-recycling → orthogonal; unclear
The `subint_proc` dedicated-thread fix (commit `26fb8206`)
addressed a different issue: `_interpreters.destroy(interp_id)`
was blocking on a trio-cache worker that had run an
earlier `interp.exec()` for that subint. Working
hypothesis at the time was "the cached thread retains the
subint's tstate".
But tstate-handling is **not specific to GIL mode**
`_PyXI_Enter` / `_PyXI_Exit` (the C-level machinery both
configs use to enter/leave a subint from a thread) should
restore the caller's tstate regardless of GIL config. So
isolated mode **doesn't obviously fix this**. It might be:
- A py3.13 bug fixed in later versions — we saw the race
first on 3.13 and never re-tested on 3.14 after moving
to dedicated threads.
- A genuine CPython quirk around cached threads that
exec'd into a subint, persisting across GIL modes.
- Something else we misdiagnosed — the empirical fix
(dedicated thread) worked but the analysis may have
been incomplete.
Only way to know: once we're on isolated mode, empirically
retry `trio.to_thread.run_sync(interp.exec, ...)` and see
if `destroy()` still blocks. If it does, keep the
dedicated thread; if not, one constraint relaxed.
### 3. Fork-from-main-interp-tstate (the constraint in this module's helper names)
The fork-from-main-interp-tstate invariant — CPython's
`PyOS_AfterFork_Child`
`_PyInterpreterState_DeleteExceptMain` gate documented in
`subint_fork_blocked_by_cpython_post_fork_issue.md` — is
about the calling thread's **current** tstate at the
moment `os.fork()` runs. If trio's cache threads never
enter subints at all, their tstate is plain main-interp,
and fork from them would be fine.
The reason the smoke test +
`run_fork_in_non_trio_thread` test helper
currently use a dedicated `threading.Thread` is narrow:
**we don't want to risk a trio cache thread that has
previously been used as a subint driver being the one that
picks up the fork job**. If cached tstate doesn't get
cleared (back to reason #2), the fork's child-side
post-init would see the wrong interp and abort.
In an isolated-mode world where msgspec works:
- `subint_proc` would use the public
`concurrent.interpreters.create()` + `Interpreter.exec()`
/ `Interpreter.close()` — which *should* handle tstate
cleanly (they're the "blessed" API).
- If so, trio's cache threads are safe to fork from
regardless of whether they've previously driven subints.
- → the `non_trio` qualifier in
`run_fork_in_non_trio_thread` becomes
*overcautious* rather than load-bearing, and the
dedicated-thread primitives in `_subint_forkserver.py`
can likely be replaced with straight
`trio.to_thread.run_sync()` wrappers.
TL;DR
-----
| constraint | fixed by isolated mode? |
|---|---|
| GIL-starvation (class A) | **yes** |
| destroy race on cached worker | unclear — empirical test on py3.14 + isolated API required |
| fork-from-main-tstate requirement on worker | **probably yes, conditional on the destroy-race question above** |
If #2 also resolves on py3.14+ with isolated mode,
tractor could drop the `non_trio` qualifier from the fork
helper's name and just use `trio.to_thread.run_sync(...)`
for everything. But **we shouldn't do that preemptively**
— the current cautious design is cheap (one dedicated
thread per fork / per subint-exec) and correct.
Audit plan when msgspec #563 lands
----------------------------------
Assuming msgspec grows `Py_mod_multiple_interpreters`
support:
1. **Flip `tractor.spawn._subint` to isolated mode.** Drop
the `_interpreters.create('legacy')` call in favor of
the public API (`concurrent.interpreters.create()` +
`Interpreter.exec()` / `Interpreter.close()`). Run the
three `ai/conc-anal/subint_*_issue.md` reproducers —
class-A (`test_stale_entry_is_deleted` etc.) should
pass without the `skipon_spawn_backend('subint')` marks
(revisit the marker inventory).
2. **Empirical destroy-race retest.** In `subint_proc`,
swap the dedicated `threading.Thread` back to
`trio.to_thread.run_sync(Interpreter.exec, ...,
abandon_on_cancel=False)` and run the full subint test
suite. If `Interpreter.close()` (or the backing
destroy) blocks the same way as the legacy version
did, revert and keep the dedicated thread.
3. **If #2 clean**, audit `_subint_forkserver.py`:
- Rename `run_fork_in_non_trio_thread` → drop the
`_non_trio_` qualifier (e.g. `run_fork_in_thread`) or
inline the two-line `trio.to_thread.run_sync` call at
the call sites and drop the helper entirely.
- Consider whether `fork_from_worker_thread` +
`run_subint_in_worker_thread` still warrant being
separate module-level primitives or whether they
collapse into a compound
`trio.to_thread.run_sync`-driven pattern inside the
(future) `subint_forkserver_proc` backend.
4. **Doc fallout.** `subint_sigint_starvation_issue.md`
and `subint_cancel_delivery_hang_issue.md` both cite
the legacy-GIL-sharing architecture as the root cause.
Close them with commit-refs to the isolated-mode
migration. This doc itself should get a closing
post-mortem section noting which of #1/#2/#3 actually
resolved vs persisted.
References
----------
- `tractor.spawn._subint_forkserver` — the in-tree module
whose constraints this doc catalogs.
- `ai/conc-anal/subint_sigint_starvation_issue.md` — the
GIL-starvation class.
- `ai/conc-anal/subint_cancel_delivery_hang_issue.md`
sibling Ctrl-C-able hang class.
- `ai/conc-anal/subint_fork_blocked_by_cpython_post_fork_issue.md`
— why fork-from-subint is blocked (this drives the
forkserver-via-non-subint-thread workaround).
- `ai/conc-anal/subint_fork_from_main_thread_smoketest.py`
— empirical validation for the workaround.
- [PEP 684 — per-interpreter GIL](https://peps.python.org/pep-0684/)
- [PEP 734 — `concurrent.interpreters` public API](https://peps.python.org/pep-0734/)
- [jcrist/msgspec#563 — PEP 684 support tracker](https://github.com/jcrist/msgspec/issues/563)
- tractor issue #379 — subint backend tracking.