Compare commits

...

10 Commits

Author SHA1 Message Date
Gud Boi 66dda9e449 Reset post-fork `_state` in forkserver child
`os.fork()` inherits the parent's entire memory image,
including `tractor.runtime._state` globals that encode
"this process is the root actor" — `_runtime_vars`'s
`_is_root=True`, pre-populated `_root_mailbox` +
`_registry_addrs`, and the parent's `_current_actor`
singleton.

A fresh `exec`-based child starts with those globals at
their module-level defaults (all falsey/empty). The
forkserver child needs to match that shape BEFORE calling
`_actor_child_main()`, otherwise `Actor.__init__()` takes
the `is_root_process() == True` branch and pre-populates
`self.enable_modules`, which then trips
`assert not self.enable_modules` at the top of
`Actor._from_parent()` on the subsequent parent→child
`SpawnSpec` handshake.

Fix: at the start of `_child_target`, null
`_state._current_actor` and overwrite `_runtime_vars` with
a cold-root blank (`_is_root=False`, empty mailbox/addrs,
`_debug_mode=False`) before `_actor_child_main()` runs.

Found-via: `test_subint_forkserver_spawn_basic` hitting
the `enable_modules` assert on child-side runtime boot.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-22 19:01:27 -04:00
Gud Boi 43bd6a6410 Wire `subint_forkserver` as first-class backend
Promote `_subint_forkserver` from primitives-only into a
registered spawn backend: `'subint_forkserver'` is now a
`SpawnMethodKey` literal, dispatched via `_methods` to
the new `subint_forkserver_proc()` target, feature-gated
under the existing `subint`-family py3.14+ case, and
selectable via `--spawn-backend=subint_forkserver`.

Deats,
- new `subint_forkserver_proc()` spawn target in
  `_subint_forkserver`:
  - mirrors `trio_proc()`'s supervision model — real OS
    subprocess so `Portal.cancel_actor()` + `soft_kill()`
    on graceful teardown, `os.kill(SIGKILL)` on hard-reap
    (no `_interpreters.destroy()` race to fuss over bc the
    child lives in its own process)
  - only real diff from `trio_proc` is the spawn mechanism:
    fork from a main-interp worker thread via
    `fork_from_worker_thread()` (off-loaded to trio's
    thread pool) instead of `trio.lowlevel.open_process()`
  - child-side `_child_target` closure runs
    `tractor._child._actor_child_main()` with
    `spawn_method='trio'` — the child is a regular trio
    actor, "subint_forkserver" names how the parent
    spawned, not what the child runs
- new `_ForkedProc` class — thin `trio.Process`-compatible
  shim around a raw OS pid: `.poll()` via
  `waitpid(WNOHANG)`, async `.wait()` off-loaded to a trio
  cache thread, `.kill()` via `SIGKILL`, `.returncode`
  cached for repeat calls. `.stdin`/`.stdout`/`.stderr`
  are `None` (fork-w/o-exec inherits parent FDs; we don't
  marshal them) which matches `soft_kill()`'s `is not None`
  guards

Also, new backend-tier test
`test_subint_forkserver_spawn_basic` drives the registered
backend end-to-end via `open_root_actor` + `open_nursery` +
`run_in_actor` w/ a trivial portal-RPC round-trip. Uses a
`forkserver_spawn_method` fixture to flip
`_spawn_method`/`_ctx` for the test's duration + restore on
teardown (so other session-level tests don't observe the
global flip). Test module docstring reworked to describe
the three tiers now covered: (1) primitive-level, (2)
parent-trio-driven primitives, (3) full registered backend.

Status: still-open work (tracked on `tractor#379`) doc'd
inline in the module docstring — no cancel/hard-kill stress
coverage yet, child-side subint-hosted root runtime still
future (gated on `msgspec#563`), thread-hygiene audit
pending the same unblock.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-22 18:49:23 -04:00
Gud Boi 5bd5f957d3 Add `subint_forkserver` PEP 684 audit-plan doc
Follow-up tracker companion to the module-docstring TODO
added in `372a0f32`. Catalogs why `_subint_forkserver`'s
two "non-trio thread" constraints
(`fork_from_worker_thread()` +
`run_subint_in_worker_thread()` both allocating dedicated
`threading.Thread`s; test helper named
`run_fork_in_non_trio_thread`) exist today, and which of
them would dissolve once msgspec PEP 684 support ships
(`msgspec#563`) and tractor flips to isolated-mode subints.

Deats,
- three reasons enumerated for the current constraints:
  - class-A GIL-starvation — **fixed** by isolated mode:
    subints don't share main's GIL so abandoned-thread
    contention disappears
  - destroy race / tstate-recycling from `subint_proc` —
    **unclear**: `_PyXI_Enter` + `_PyXI_Exit` are
    cross-mode, so isolated doesn't obviously fix it;
    needs empirical retest on py3.14 + isolated API
  - fork-from-main-interp-tstate (the CPython-level
    `_PyInterpreterState_DeleteExceptMain` gate) — the
    narrow reason for using a dedicated thread; **probably
    fixed** IF the destroy-race also resolves (bc trio's
    cache threads never drove subints → clean main-interp
    tstate)
- TL;DR table of which constraints unwind under each
  resolution branch
- four-step audit plan for when `msgspec#563` lands:
  - flip `_subint` to isolated mode
  - empirical destroy-race retest
  - audit `_subint_forkserver.py` — drop `non_trio`
    qualifier / maybe inline primitives
  - doc fallout — close the three `subint_*_issue.md`
    siblings w/ post-mortem notes

Also, cross-refs the three sibling `conc-anal/` docs, PEPs
684 + 734, `msgspec#563`, and `tractor#379` (the overall
subint spawn-backend tracking issue).

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-22 18:18:30 -04:00
Gud Boi 1d4867e51c Add trio-parent tests for `_subint_forkserver`
New pytest module `tests/spawn/test_subint_forkserver.py`
drives the forkserver primitives from inside a real
`trio.run()` in the parent — the runtime shape tractor will
actually use when we wire up a `subint_forkserver` spawn
backend proper. Complements the standalone no-trio-in-parent
`ai/conc-anal/subint_fork_from_main_thread_smoketest.py`.

Deats,
- new test pkg `tests/spawn/` (+ empty `__init__.py`)
- two tests, both `@pytest.mark.timeout(30, method='thread')`
  for the GIL-hostage safety reason doc'd in
  `ai/conc-anal/subint_sigint_starvation_issue.md`:
  - `test_fork_from_worker_thread_via_trio` — parent-side
    plumbing baseline. `trio.run()` off-loads forkserver
    prims via `trio.to_thread.run_sync()` + asserts the
    child reaps cleanly
  - `test_fork_and_run_trio_in_child` — end-to-end: forked
    child calls `run_subint_in_worker_thread()` with a
    bootstrap str that does `trio.run()` in a fresh subint
- both tests wrap the inner `trio.run()` in a
  `dump_on_hang()` for post-mortem if the outer
  `pytest-timeout` fires
- intentionally NOT using `--spawn-backend` — the tests
  drive the primitives directly rather than going through
  tractor's spawn-method registry (which the forkserver
  isn't plugged into yet)

Also, rename `run_trio_in_subint()` →
`run_subint_in_worker_thread()` for naming consistency with
the sibling `fork_from_worker_thread()`. The action is really
"host a subint on a worker thread", not specifically "run
trio" — trio just happens to be the typical payload.
Propagate the rename to the smoketest.

Further, add a "TODO — cleanup gated on msgspec PEP 684
support" section to the `_subint_forkserver` module
docstring: flags the dedicated-`threading.Thread` design as
potentially-revisable once isolated-mode subints are viable
in tractor. Cross-refs `msgspec#563` + `tractor#379` and
points at an audit-plan conc-anal doc we'll add next.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-22 18:00:06 -04:00
Gud Boi 372a0f3247 Lift fork prims into `_subint_forkserver` mod
The smoketest (prior commit) empirically validated the
"fork-from-main-interp-worker-thread" arch on py3.14. Promote
the validated primitives out of the `ai/conc-anal/` smoketest
into `tractor.spawn._subint_forkserver` so they can eventually
be wired into a real "subint forkserver" spawn backend.

Deats,
- new module `tractor/spawn/_subint_forkserver.py` (337 LOC):
  - `fork_from_worker_thread(child_target, thread_name)` —
    spawn a main-interp `threading.Thread`, call `os.fork()`
    from it, shuttle the child pid back to main via a pipe
  - `run_trio_in_subint(bootstrap, ...)` — post-fork helper:
    create a fresh subint + drive `_interpreters.exec()` on
    a dedicated worker thread running the `bootstrap` str
    (typically imports `trio`, defines an async entry, calls
    `trio.run()`)
  - `wait_child(pid, expect_exit_ok)` — `os.waitpid()` +
    pass/fail classification reusable from harness AND the
    eventual real spawn path
- feature-gated py3.14+ via the public
  `concurrent.interpreters` presence check; matches the gate
  in `tractor.spawn._subint`
- module docstring doc's the CPython-block context
  (cross-refs `_subint_fork` stub + the two `conc-anal/`
  docs) and status: EXPERIMENTAL, not yet registered in
  `_spawn._methods`

Also, refactor the smoketest
`ai/conc-anal/subint_fork_from_main_thread_smoketest.py` to
import the primitives from the new module rather than inline
its own copies. Keeps the smoketest and the tractor-side
impl in sync as the forkserver design evolves; the smoketest
remains a zero-`tractor`-runtime CPython-level check
(imports ONLY the three primitives, no runtime bring-up).

Status: next step is to drive these from a parent-side
`trio.run()` and hook the returned child pid into the normal
actor-nursery/IPC flow — then register `subint_forkserver`
as a `SpawnMethodKey` in `_spawn.py`.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-22 17:03:15 -04:00
Gud Boi 37cfaa202b Add CPython-level `subint_fork` workaround smoketest
Standalone script to validate the "main-interp worker-thread
forkserver + subint-hosted trio" arch proposed as a workaround
to the CPython-level refusal doc'd in
`ai/conc-anal/subint_fork_blocked_by_cpython_post_fork_issue.md`.

Deliberately NOT a `tractor` test — zero `tractor` imports.
Uses `_interpreters` (private stdlib) + `os.fork()` directly so
pass/fail is a property of CPython alone, independent of our
runtime. Requires py3.14+.

Deats,
- four scenarios via `--scenario`:
  - `control_subint_thread_fork` — the KNOWN-BROKEN case as a
    harness sanity; if the child DOESN'T abort, our analysis
    is wrong
  - `main_thread_fork` — baseline sanity, must always succeed
  - `worker_thread_fork` — architectural assertion: regular
    `threading.Thread` attached to main interp calls
    `os.fork()`; child should survive post-fork cleanup
  - `full_architecture` — end-to-end: fork from a main-interp
    worker thread, then in child create a subint driving a
    worker thread running `trio.run()`
- exit code 0 on EXPECTED outcome (for `control_*` that means
  "child aborted", not "child succeeded")
- each scenario prints a self-contained pass/fail banner; use
  `os.waitpid()` of the parent + per-scenario status prints to
  observe the child's fate

Also, log NLNet provenance for this session's three-sub-phase
work (py3.13 gate tightening, `pytest-timeout` + marker
refactor, `subint_fork` prototype → CPython-block finding).

Prompt-IO: ai/prompt-io/claude/20260422T200723Z_797f57c_prompt_io.md

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-22 16:40:52 -04:00
Gud Boi 797f57ce7b Doc `subint_fork` as blocked by CPython post-fork
Empirical finding: the WIP `subint_fork_proc` scaffold
landed in `cf0e3e6f` does *not* work on current CPython.
The `fork()` syscall succeeds in the parent, but the
CHILD aborts immediately during
`PyOS_AfterFork_Child()` →
`_PyInterpreterState_DeleteExceptMain()`, which gates
on the current tstate belonging to the main interp —
the child dies with `Fatal Python error: not main
interpreter`.

CPython devs acknowledge the fragility with an in-source
comment (`// Ideally we could guarantee tstate is running
main.`) but expose no user-facing hook to satisfy the
precondition — so the strategy is structurally dead until
upstream changes.

Rather than delete the scaffold, reshape it into a
documented dead-end so the next person with this idea
lands on the reason rather than rediscovering the same
CPython-level refusal.

Deats,
- Move `subint_fork_proc` out of `tractor.spawn._subint`
  into a new `tractor.spawn._subint_fork` dedicated
  module (153 LOC). Module + fn docstrings now describe
  the blockage directly; the fn body is trimmed to a
  `NotImplementedError` pointing at the analysis doc —
  no more dead-code `bootstrap` sketch bloating
  `_subint.py`.
- `_spawn.py`: keep `'subint_fork'` in `SpawnMethodKey`
  + the `_methods` dispatch so
  `--spawn-backend=subint_fork` routes to a clean
  `NotImplementedError` rather than "invalid backend";
  comment calls out the blockage. Collapse the duplicate
  py3.14 feature-gate in `try_set_start_method()` into a
  combined `case 'subint' | 'subint_fork':` arm.
- New 337-line analysis:
  `ai/conc-anal/subint_fork_blocked_by_cpython_post_fork_issue.md`.
  Annotated walkthrough from the user-visible fatal
  error down to the specific `Modules/posixmodule.c` +
  `Python/pystate.c` source lines enforcing the refusal,
  plus an upstream-report draft.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-22 16:02:01 -04:00
Gud Boi cf0e3e6f8b Add WIP `subint_fork_proc` backend scaffold
Experimental third spawn backend: use a fresh
sub-interpreter purely as a trio-free launchpad from
which to `os.fork()` + exec back into
`python -m tractor._child`. Per issue #379's
"fork()-workaround/hacks" thread.

Intent is to sidestep both,
- the trio+fork hazards hitting `trio_proc` (python- trio/trio#1614 et
  al.), since the forking interp is guaranteed trio-free.

- the shared-GIL abandoned-thread hazards hitting `subint_proc`
  (`ai/conc-anal/subint_sigint_starvation_issue.md`), since we don't
  *stay* in the subint — it only lives long enough to call `os.fork()`

Downstream of the fork+exec, all the existing `trio_proc` plumbing is
reused verbatim: `ipc_server.wait_for_peer()`, `SpawnSpec`, `Portal`
yield, soft-kill.

Status: NOT wired up beyond scaffolding. The fn raises
`NotImplementedError` immediately; the `bootstrap` fork/exec string
builder and the `# TODO: orchestrate driver thread` block are kept
in-tree as deliberate dead code so the next iteration starts from
a concrete shape rather than a blank page.

Docstring calls out three open questions that need
empirical validation before wiring this up:
1. Does CPython permit `os.fork()` from a non-main
   legacy subint?
2. Can the child stay fork-without-exec and
   `trio.run()` directly from within the launchpad
   subint?
3. How do `signal.set_wakeup_fd()` handlers and other
   process-global state interact when the forking
   thread is inside a subint?

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-22 13:32:39 -04:00
Gud Boi 99d70337b7 Mark `subint`-hanging tests with `skipon_spawn_backend`
Adopt the `@pytest.mark.skipon_spawn_backend('subint',
reason=...)` marker (a617b521) across the suites
reproducing the `subint` GIL-contention / starvation
hang classes doc'd in `ai/conc-anal/subint_*_issue.md`.

Deats,
- Module-level `pytestmark` on full-file-hanging suites:
  - `tests/test_cancellation.py`
  - `tests/test_inter_peer_cancellation.py`
  - `tests/test_pubsub.py`
  - `tests/test_shm.py`
- Per-test decorator where only one test in the file
  hangs:
  - `tests/discovery/test_registrar.py
    ::test_stale_entry_is_deleted` — replaces the
    inline `if start_method == 'subint': pytest.skip`
    branch with a declarative skip.
  - `tests/test_subint_cancellation.py
    ::test_subint_non_checkpointing_child`.
- A few per-test decorators are left commented-in-
  place as breadcrumbs for later finer-grained unskips.

Also, some nearby tidying in the affected files:
- Annotate loose fixture / test params
  (`pytest.FixtureRequest`, `str`, `tuple`, `bool`) in
  `tests/conftest.py`, `tests/devx/conftest.py`, and
  `tests/test_cancellation.py`.
- Normalize `"""..."""` → `'''...'''` docstrings per
  repo convention on a few touched tests.
- Add `timeout=6` / `timeout=10` to
  `@tractor_test(...)` on `test_cancel_infinite_streamer`
  and `test_some_cancels_all`.
- Drop redundant `spawn_backend` param from
  `test_cancel_via_SIGINT`; use `start_method` in the
  `'mp' in ...` check instead.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-21 21:33:15 -04:00
Gud Boi a617b52140 Add `skipon_spawn_backend` pytest marker
A reusable `@pytest.mark.skipon_spawn_backend( '<backend>' [, ...],
reason='...')` marker for backend-specific known-hang / -borked cases
— avoids scattering `@pytest.mark.skipif(lambda ...)` branches across
tests that misbehave under a particular `--spawn-backend`.

Deats,
- `pytest_configure()` registers the marker via
  `addinivalue_line('markers', ...)`.
- New `pytest_collection_modifyitems()` hook walks
  each collected item with `item.iter_markers(
  name='skipon_spawn_backend')`, checks whether the
  active `--spawn-backend` appears in `mark.args`, and
  if so injects a concrete `pytest.mark.skip(
  reason=...)`. `iter_markers()` makes the decorator
  work at function, class, or module (`pytestmark =
  [...]`) scope transparently.
- First matching mark wins; default reason is
  `f'Borked on --spawn-backend={backend!r}'` if the
  caller doesn't supply one.

Also, tighten type annotations on nearby `pytest`
integration points — `pytest_configure`, `debug_mode`,
`spawn_backend`, `tpt_protos`, `tpt_proto` — now taking
typed `pytest.Config` / `pytest.FixtureRequest` params.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-21 21:24:51 -04:00
20 changed files with 2789 additions and 49 deletions

View File

@ -0,0 +1,337 @@
# `os.fork()` from a non-main sub-interpreter aborts the child (CPython refuses post-fork cleanup)
Third `subint`-class analysis in this project. Unlike its
two siblings (`subint_sigint_starvation_issue.md`,
`subint_cancel_delivery_hang_issue.md`), this one is not a
hang — it's a **hard CPython-level refusal** of an
experimental spawn strategy we wanted to try.
## TL;DR
An in-process sub-interpreter cannot be used as a
"launchpad" for `os.fork()` on current CPython. The fork
syscall succeeds in the parent, but the forked CHILD
process is aborted immediately by CPython's post-fork
cleanup with:
```
Fatal Python error: _PyInterpreterState_DeleteExceptMain: not main interpreter
```
This is enforced by a hard `PyStatus_ERR` gate in
`Python/pystate.c`. The CPython devs acknowledge the
fragility with an in-source comment (`// Ideally we could
guarantee tstate is running main.`) but provide no
mechanism to satisfy the precondition from user code.
**Implication for tractor**: the `subint_fork` backend
sketched in `tractor.spawn._subint_fork` is structurally
dead on current CPython. The submodule is kept as
documentation of the attempt; `--spawn-backend=subint_fork`
raises `NotImplementedError` pointing here.
## Context — why we tried this
The motivation is issue #379's "Our own thoughts, ideas
for `fork()`-workaround/hacks..." section. The existing
trio-backend (`tractor.spawn._trio.trio_proc`) spawns
subactors via `trio.lowlevel.open_process()` → ultimately
`posix_spawn()` or `fork+exec`, from the parent's main
interpreter that is currently running `trio.run()`. This
brushes against a known-fragile interaction between
`trio` and `fork()` tracked in
[python-trio/trio#1614](https://github.com/python-trio/trio/issues/1614)
and siblings — mostly mitigated in `tractor`'s case only
incidentally (we `exec()` immediately post-fork).
The idea was:
1. Create a subint that has *never* imported `trio`.
2. From a worker thread in that subint, call `os.fork()`.
3. In the child, `execv()` back into
`python -m tractor._child` — same as `trio_proc` does.
4. The fork is from a trio-free context → trio+fork
hazards avoided regardless of downstream behavior.
The parent-side orchestration (`ipc_server.wait_for_peer`,
`SpawnSpec`, `Portal` yield) would reuse
`trio_proc`'s flow verbatim, with only the subproc-spawn
mechanics swapped.
## Symptom
Running the prototype (`tractor.spawn._subint_fork.subint_fork_proc`,
see git history prior to the stub revert) on py3.14:
```
Fatal Python error: _PyInterpreterState_DeleteExceptMain: not main interpreter
Python runtime state: initialized
Current thread 0x00007f6b71a456c0 [subint-fork-lau] (most recent call first):
File "<script>", line 2 in <module>
<script>:2: DeprecationWarning: This process (pid=802985) is multi-threaded, use of fork() may lead to deadlocks in the child.
```
Key clues:
- The **`DeprecationWarning`** fires in the parent (before
fork completes) — fork *is* executing, we get that far.
- The **`Fatal Python error`** comes from the child — it
aborts during CPython's post-fork C initialization
before any user Python runs in the child.
- The thread name `subint-fork-lau[nchpad]` is ours —
confirms the fork is being called from the launchpad
subint's driver thread.
## CPython source walkthrough
### Call site — `Modules/posixmodule.c:728-793`
The post-fork-child hook CPython runs in the child process:
```c
void
PyOS_AfterFork_Child(void)
{
PyStatus status;
_PyRuntimeState *runtime = &_PyRuntime;
// re-creates runtime->interpreters.mutex (HEAD_UNLOCK)
status = _PyRuntimeState_ReInitThreads(runtime);
...
PyThreadState *tstate = _PyThreadState_GET();
_Py_EnsureTstateNotNULL(tstate);
...
// Ideally we could guarantee tstate is running main. ← !!!
_PyInterpreterState_ReinitRunningMain(tstate);
status = _PyEval_ReInitThreads(tstate);
...
status = _PyInterpreterState_DeleteExceptMain(runtime);
if (_PyStatus_EXCEPTION(status)) {
goto fatal_error;
}
...
fatal_error:
Py_ExitStatusException(status);
}
```
The `// Ideally we could guarantee tstate is running
main.` comment is a flashing warning sign — the CPython
devs *know* this path is fragile when fork is called from
a non-main subint, but they've chosen to abort rather than
silently corrupt state. Arguably the right call.
### The refusal — `Python/pystate.c:1035-1075`
```c
/*
* Delete all interpreter states except the main interpreter. If there
* is a current interpreter state, it *must* be the main interpreter.
*/
PyStatus
_PyInterpreterState_DeleteExceptMain(_PyRuntimeState *runtime)
{
struct pyinterpreters *interpreters = &runtime->interpreters;
PyThreadState *tstate = _PyThreadState_Swap(runtime, NULL);
if (tstate != NULL && tstate->interp != interpreters->main) {
return _PyStatus_ERR("not main interpreter"); ← our error
}
HEAD_LOCK(runtime);
PyInterpreterState *interp = interpreters->head;
interpreters->head = NULL;
while (interp != NULL) {
if (interp == interpreters->main) {
interpreters->main->next = NULL;
interpreters->head = interp;
interp = interp->next;
continue;
}
// XXX Won't this fail since PyInterpreterState_Clear() requires
// the "current" tstate to be set?
PyInterpreterState_Clear(interp); // XXX must activate?
zapthreads(interp);
...
}
...
}
```
The comment in the docstring (`If there is a current
interpreter state, it *must* be the main interpreter.`) is
the formal API contract. The `XXX` comments further in
suggest the CPython team is already aware this function
has latent issues even in the happy path.
## Chain summary
1. Our launchpad subint's driver OS-thread calls
`os.fork()`.
2. `fork()` succeeds. Child wakes up with:
- The parent's full memory image (including all
subints).
- Only the *calling* thread alive (the driver thread).
- `_PyThreadState_GET()` on that thread returns the
**launchpad subint's tstate**, *not* main's.
3. CPython runs `PyOS_AfterFork_Child()`.
4. It reaches `_PyInterpreterState_DeleteExceptMain()`.
5. Gate check fails: `tstate->interp != interpreters->main`.
6. `PyStatus_ERR("not main interpreter")``fatal_error`
goto → `Py_ExitStatusException()` → child aborts.
Parent-side consequence: `os.fork()` in the subint
bootstrap returned successfully with the child's PID, but
the child died before connecting back. Our parent's
`ipc_server.wait_for_peer(uid)` would hang forever — the
child never gets to `_actor_child_main`.
## Definitive answer to "Open Question 1"
From the (now-stub) `subint_fork_proc` docstring:
> Does CPython allow `os.fork()` from a non-main
> sub-interpreter under the legacy config?
**No.** Not in a usable-by-user-code sense. The fork
syscall is not blocked, but the child cannot survive
CPython's post-fork initialization. This is enforced, not
accidental, and the CPython devs have acknowledged the
fragility in-source.
## What we'd need from CPython to unblock
Any one of these, from least-to-most invasive:
1. **A pre-fork hook mechanism** that lets user code (or
tractor itself via `os.register_at_fork(before=...)`)
swap the current tstate to main before fork runs. The
swap would need to work across the subint→main
boundary, which is the actual hard part —
`_PyThreadState_Swap()` exists but is internal.
2. **A `_PyInterpreterState_DeleteExceptFor(tstate->interp)`
variant** that cleans up all *other* subints while
preserving the calling subint's state. Lets the child
continue executing in the subint after fork; a
subsequent `execv()` clears everything at the OS
level anyway.
3. **A cleaner error** than `Fatal Python error` aborting
the child. Even without fixing the underlying
capability, a raised Python-level exception in the
parent's `fork()` call (rather than a silent child
abort) would at least make the failure mode
debuggable.
## Upstream-report draft (for CPython issue tracker)
### Title
> `os.fork()` from a non-main sub-interpreter aborts the
> child with a fatal error in `PyOS_AfterFork_Child`; can
> we at least make it a clean `RuntimeError` in the
> parent?
### Body
> **Version**: Python 3.14.x
>
> **Summary**: Calling `os.fork()` from a thread currently
> executing inside a sub-interpreter causes the forked
> child process to abort during CPython's post-fork
> cleanup, with the following output in the child:
>
> ```
> Fatal Python error: _PyInterpreterState_DeleteExceptMain: not main interpreter
> ```
>
> From the **parent's** point of view the fork succeeded
> (returned a valid child PID). The failure is completely
> opaque to parent-side Python code — unless the parent
> does `os.waitpid()` it won't even notice the child
> died.
>
> **Root cause** (as I understand it from reading sources):
> `Modules/posixmodule.c::PyOS_AfterFork_Child()` calls
> `_PyInterpreterState_DeleteExceptMain()` with a
> precondition that `_PyThreadState_GET()->interp` be the
> main interpreter. When `fork()` is called from a thread
> executing inside a subinterpreter, the child wakes up
> with its tstate still pointing at the subint, and the
> gate in `Python/pystate.c:1044-1047` fails.
>
> A comment in the source
> (`Modules/posixmodule.c:753` — `// Ideally we could
> guarantee tstate is running main.`) suggests this is a
> known-fragile path rather than an intentional
> invariant.
>
> **Use case**: I was experimenting with using a
> sub-interpreter as a "fork launchpad" — have a subint
> that has never imported `trio`, call `os.fork()` from
> that subint's thread, and in the child `execv()` back
> into a fresh Python interpreter process. The goal was
> to sidestep known issues with `trio` + `fork()`
> interaction (see
> [python-trio/trio#1614](https://github.com/python-trio/trio/issues/1614))
> by guaranteeing the forking context had never been
> "contaminated" by trio's imports or globals. This
> approach would allow `trio`-using applications to
> combine `fork`-based subprocess spawning with
> per-worker `trio.run()` runtimes — a fairly common
> pattern that currently requires workarounds.
>
> **Request**:
>
> Ideally: make fork-from-subint work (e.g., by swapping
> the caller's tstate to main in the pre-fork hook), or
> provide a `_PyInterpreterState_DeleteExceptFor(interp)`
> variant that permits the caller's subint to survive
> post-fork so user code can subsequently `execv()`.
>
> Minimally: convert the fatal child-side abort into a
> clean `RuntimeError` (or similar) raised in the
> parent's `fork()` call. Even if the capability isn't
> expanded, the failure mode should be debuggable by
> user-code in the parent — right now it's a silent
> child death with an error message buried in the
> child's stderr that parent code can't programmatically
> see.
>
> **Related**: PEP 684 (per-interpreter GIL), PEP 734
> (`concurrent.interpreters` public API). The private
> `_interpreters` module is what I used to create the
> launchpad — behavior is the same whether using
> `_interpreters.create('legacy')` or
> `concurrent.interpreters.create()` (the latter was not
> tested but the gate is identical).
>
> Happy to contribute a minimal reproducer + test case if
> this is something the team wants to pursue.
## References
- `Modules/posixmodule.c:728`
[`PyOS_AfterFork_Child`](https://github.com/python/cpython/blob/main/Modules/posixmodule.c#L728)
- `Python/pystate.c:1040`
[`_PyInterpreterState_DeleteExceptMain`](https://github.com/python/cpython/blob/main/Python/pystate.c#L1040)
- PEP 684 (per-interpreter GIL):
<https://peps.python.org/pep-0684/>
- PEP 734 (`concurrent.interpreters` public API):
<https://peps.python.org/pep-0734/>
- [python-trio/trio#1614](https://github.com/python-trio/trio/issues/1614)
— the original motivation for the launchpad idea.
- tractor issue #379 — "Our own thoughts, ideas for
`fork()`-workaround/hacks..." section where this was
first sketched.
- `tractor.spawn._subint_fork` — in-tree stub preserving
the attempted impl's shape in git history.

View File

@ -0,0 +1,373 @@
#!/usr/bin/env python3
'''
Standalone CPython-level feasibility check for the "main-interp
worker-thread forkserver + subint-hosted trio" architecture
proposed as a workaround to the CPython-level refusal
documented in
`ai/conc-anal/subint_fork_blocked_by_cpython_post_fork_issue.md`.
Purpose
-------
Deliberately NOT a `tractor` test. Zero `tractor` imports.
Uses `_interpreters` (private stdlib) + `os.fork()` directly so
the signal is unambiguous pass/fail here is a property of
CPython alone, independent of our runtime.
Run each scenario in isolation; the child's fate is observable
only via `os.waitpid()` of the parent and the scenario's own
status prints.
Scenarios (pick one with `--scenario <name>`)
---------------------------------------------
- `control_subint_thread_fork` the KNOWN-BROKEN case we
documented in `subint_fork_blocked_by_cpython_post_fork_issue.md`:
drive a subint from a thread, call `os.fork()` inside its
`_interpreters.exec()`, watch the child abort. **Included as
a control** if this scenario DOESN'T abort the child, our
analysis is wrong and we should re-check everything.
- `main_thread_fork` baseline sanity. Call `os.fork()` from
the process's main thread. Must always succeed; if this
fails something much bigger is broken.
- `worker_thread_fork` the architectural assertion. Spawn a
regular `threading.Thread` (attached to main interp, NOT a
subint), have IT call `os.fork()`. Child should survive
post-fork cleanup.
- `full_architecture` end-to-end: main-interp worker thread
forks. In the child, fork-thread (still main-interp) creates
a subint, drives a second worker thread inside it that runs
a trivial `trio.run()`. Validates the "root runtime lives in
a subint in the child" piece of the proposed arch.
All scenarios print a self-contained pass/fail banner. Exit
code 0 on expected outcome (which for `control_*` means "child
aborted", not "child succeeded"!).
Requires Python 3.14+.
Usage
-----
::
python subint_fork_from_main_thread_smoketest.py \\
--scenario main_thread_fork
python subint_fork_from_main_thread_smoketest.py \\
--scenario full_architecture
'''
from __future__ import annotations
import argparse
import os
import sys
import threading
import time
# Hard-require py3.14 for the public `concurrent.interpreters`
# API (we still drop to `_interpreters` internally, same as
# `tractor.spawn._subint`).
try:
from concurrent import interpreters as _public_interpreters # noqa: F401
import _interpreters # type: ignore
except ImportError:
print(
'FAIL (setup): requires Python 3.14+ '
'(missing `concurrent.interpreters`)',
file=sys.stderr,
)
sys.exit(2)
# The actual primitives this script exercises live in
# `tractor.spawn._subint_forkserver` — we re-import them here
# rather than inlining so the module and the validation stay
# in sync. (Early versions of this file had them inline for
# the "zero tractor imports" isolation guarantee; now that
# CPython-level feasibility is confirmed, the validated
# primitives have moved into tractor proper.)
from tractor.spawn._subint_forkserver import (
fork_from_worker_thread,
run_subint_in_worker_thread,
wait_child,
)
# ----------------------------------------------------------------
# small observability helpers (test-harness only)
# ----------------------------------------------------------------
def _banner(title: str) -> None:
line = '=' * 60
print(f'\n{line}\n{title}\n{line}', flush=True)
def _report(
label: str,
*,
ok: bool,
status_str: str,
expect_exit_ok: bool,
) -> None:
verdict: str = 'PASS' if ok else 'FAIL'
expected_str: str = (
'normal exit (rc=0)'
if expect_exit_ok
else 'abnormal death (signal or nonzero exit)'
)
print(
f'[{verdict}] {label}: '
f'expected {expected_str}; observed {status_str}',
flush=True,
)
# ----------------------------------------------------------------
# scenario: `control_subint_thread_fork` (known-broken)
# ----------------------------------------------------------------
def scenario_control_subint_thread_fork() -> int:
_banner(
'[control] fork from INSIDE a subint (expected: child aborts)'
)
interp_id = _interpreters.create('legacy')
print(f' created subint {interp_id}', flush=True)
# Shared flag: child writes a sentinel file we can detect from
# the parent. If the child manages to write this, CPython's
# post-fork refusal is NOT happening → analysis is wrong.
sentinel = '/tmp/subint_fork_smoketest_control_child_ran'
try:
os.unlink(sentinel)
except FileNotFoundError:
pass
bootstrap = (
'import os\n'
'pid = os.fork()\n'
'if pid == 0:\n'
# child — if CPython's refusal fires this code never runs
f' with open({sentinel!r}, "w") as f:\n'
' f.write("ran")\n'
' os._exit(0)\n'
'else:\n'
# parent side (inside the launchpad subint) — stash the
# forked PID on a shareable dict so we can waitpid()
# from the outer main interp. We can't just return it;
# _interpreters.exec() returns nothing useful.
' import builtins\n'
' builtins._forked_child_pid = pid\n'
)
# NOTE, we can't easily pull state back from the subint.
# For the CONTROL scenario we just time-bound the fork +
# check the sentinel. If sentinel exists → child ran →
# analysis wrong. If not → child aborted → analysis
# confirmed.
done = threading.Event()
def _drive() -> None:
try:
_interpreters.exec(interp_id, bootstrap)
except Exception as err:
print(
f' subint bootstrap raised (expected on some '
f'CPython versions): {type(err).__name__}: {err}',
flush=True,
)
finally:
done.set()
t = threading.Thread(
target=_drive,
name='control-subint-fork-launchpad',
daemon=True,
)
t.start()
done.wait(timeout=5.0)
t.join(timeout=2.0)
# Give the (possibly-aborted) child a moment to die.
time.sleep(0.5)
sentinel_present = os.path.exists(sentinel)
verdict = (
# "PASS" for our analysis means sentinel NOT present.
'PASS' if not sentinel_present else 'FAIL (UNEXPECTED)'
)
print(
f'[{verdict}] control: sentinel present={sentinel_present} '
f'(analysis predicts False — child should abort before '
f'writing)',
flush=True,
)
if sentinel_present:
os.unlink(sentinel)
try:
_interpreters.destroy(interp_id)
except _interpreters.InterpreterError:
pass
return 0 if not sentinel_present else 1
# ----------------------------------------------------------------
# scenario: `main_thread_fork` (baseline sanity)
# ----------------------------------------------------------------
def scenario_main_thread_fork() -> int:
_banner(
'[baseline] fork from MAIN thread (expected: child exits normally)'
)
pid = os.fork()
if pid == 0:
os._exit(0)
return 0 if _wait_child(
pid,
label='main_thread_fork',
expect_exit_ok=True,
) else 1
# ----------------------------------------------------------------
# scenario: `worker_thread_fork` (architectural assertion)
# ----------------------------------------------------------------
def _run_worker_thread_fork_scenario(
label: str,
*,
child_target=None,
) -> int:
'''
Thin wrapper: delegate the actual fork to the
`tractor.spawn._subint_forkserver` primitive, then wait
on the child and render a pass/fail banner.
'''
try:
pid: int = fork_from_worker_thread(
child_target=child_target,
thread_name=f'worker-fork-thread[{label}]',
)
except RuntimeError as err:
print(f'[FAIL] {label}: {err}', flush=True)
return 1
print(f' forked child pid={pid}', flush=True)
ok, status_str = wait_child(pid, expect_exit_ok=True)
_report(
label,
ok=ok,
status_str=status_str,
expect_exit_ok=True,
)
return 0 if ok else 1
def scenario_worker_thread_fork() -> int:
_banner(
'[arch] fork from MAIN-INTERP WORKER thread '
'(expected: child exits normally — this is the one '
'that matters)'
)
return _run_worker_thread_fork_scenario(
'worker_thread_fork',
)
# ----------------------------------------------------------------
# scenario: `full_architecture`
# ----------------------------------------------------------------
_CHILD_TRIO_BOOTSTRAP: str = (
'import trio\n'
'async def _main():\n'
' await trio.sleep(0.05)\n'
' return 42\n'
'result = trio.run(_main)\n'
'assert result == 42, f"trio.run returned {result}"\n'
'print(" CHILD subint: trio.run OK, result=42", '
'flush=True)\n'
)
def _child_trio_in_subint() -> int:
'''
CHILD-side `child_target`: drive a trivial `trio.run()`
inside a fresh legacy-config subint on a worker thread,
using the `tractor.spawn._subint_forkserver.run_subint_in_worker_thread`
primitive. Returns 0 on success.
'''
try:
run_subint_in_worker_thread(
_CHILD_TRIO_BOOTSTRAP,
thread_name='child-subint-trio-thread',
)
except RuntimeError as err:
print(
f' CHILD: run_subint_in_worker_thread timed out / thread '
f'never returned: {err}',
flush=True,
)
return 3
except BaseException as err:
print(
f' CHILD: subint bootstrap raised: '
f'{type(err).__name__}: {err}',
flush=True,
)
return 4
return 0
def scenario_full_architecture() -> int:
_banner(
'[arch-full] worker-thread fork + child runs trio in a '
'subint (end-to-end proposed arch)'
)
return _run_worker_thread_fork_scenario(
'full_architecture',
child_target=_child_trio_in_subint,
)
# ----------------------------------------------------------------
# main
# ----------------------------------------------------------------
SCENARIOS: dict[str, Callable[[], int]] = {
'control_subint_thread_fork': scenario_control_subint_thread_fork,
'main_thread_fork': scenario_main_thread_fork,
'worker_thread_fork': scenario_worker_thread_fork,
'full_architecture': scenario_full_architecture,
}
def main() -> int:
ap = argparse.ArgumentParser(
description=__doc__,
formatter_class=argparse.RawDescriptionHelpFormatter,
)
ap.add_argument(
'--scenario',
choices=sorted(SCENARIOS.keys()),
required=True,
)
args = ap.parse_args()
return SCENARIOS[args.scenario]()
if __name__ == '__main__':
sys.exit(main())

View File

@ -0,0 +1,184 @@
# Revisit `subint_forkserver` thread-cache constraints once msgspec PEP 684 support lands
Follow-up tracker for cleanup work gated on the msgspec
PEP 684 adoption upstream ([jcrist/msgspec#563](https://github.com/jcrist/msgspec/issues/563)).
Context — why this exists
-------------------------
The `tractor.spawn._subint_forkserver` submodule currently
carries two "non-trio" thread-hygiene constraints whose
necessity is tangled with issues that *should* dissolve
under PEP 684 isolated-mode subinterpreters:
1. `fork_from_worker_thread()` / `run_subint_in_worker_thread()`
internally allocate a **dedicated `threading.Thread`**
rather than using `trio.to_thread.run_sync()`.
2. The test helper is named
`run_fork_in_non_trio_thread()` — the
`non_trio` qualifier is load-bearing today.
This doc catalogs *why* those constraints exist, which of
them isolated-mode would fix, and what the
audit-and-cleanup path looks like once msgspec #563 is
resolved.
The three reasons the constraints exist
---------------------------------------
### 1. GIL-starvation class → fixed by PEP 684 isolated mode
The class-A hang documented in
`subint_sigint_starvation_issue.md` is entirely about
legacy-config subints **sharing the main GIL**. Once
msgspec #563 lands and tractor flips
`tractor.spawn._subint` to
`concurrent.interpreters.create()` (isolated config), each
subint gets its own GIL. Abandoned subint threads can't
contend for main's GIL → can't starve the main trio loop
→ signal-wakeup-pipe drains normally → no SIGINT-drop.
This class of hazard **dissolves entirely**. The
non-trio-thread requirement for *this reason* disappears.
### 2. Destroy race / tstate-recycling → orthogonal; unclear
The `subint_proc` dedicated-thread fix (commit `26fb8206`)
addressed a different issue: `_interpreters.destroy(interp_id)`
was blocking on a trio-cache worker that had run an
earlier `interp.exec()` for that subint. Working
hypothesis at the time was "the cached thread retains the
subint's tstate".
But tstate-handling is **not specific to GIL mode**
`_PyXI_Enter` / `_PyXI_Exit` (the C-level machinery both
configs use to enter/leave a subint from a thread) should
restore the caller's tstate regardless of GIL config. So
isolated mode **doesn't obviously fix this**. It might be:
- A py3.13 bug fixed in later versions — we saw the race
first on 3.13 and never re-tested on 3.14 after moving
to dedicated threads.
- A genuine CPython quirk around cached threads that
exec'd into a subint, persisting across GIL modes.
- Something else we misdiagnosed — the empirical fix
(dedicated thread) worked but the analysis may have
been incomplete.
Only way to know: once we're on isolated mode, empirically
retry `trio.to_thread.run_sync(interp.exec, ...)` and see
if `destroy()` still blocks. If it does, keep the
dedicated thread; if not, one constraint relaxed.
### 3. Fork-from-main-interp-tstate (the constraint in this module's helper names)
The fork-from-main-interp-tstate invariant — CPython's
`PyOS_AfterFork_Child`
`_PyInterpreterState_DeleteExceptMain` gate documented in
`subint_fork_blocked_by_cpython_post_fork_issue.md` — is
about the calling thread's **current** tstate at the
moment `os.fork()` runs. If trio's cache threads never
enter subints at all, their tstate is plain main-interp,
and fork from them would be fine.
The reason the smoke test +
`run_fork_in_non_trio_thread` test helper
currently use a dedicated `threading.Thread` is narrow:
**we don't want to risk a trio cache thread that has
previously been used as a subint driver being the one that
picks up the fork job**. If cached tstate doesn't get
cleared (back to reason #2), the fork's child-side
post-init would see the wrong interp and abort.
In an isolated-mode world where msgspec works:
- `subint_proc` would use the public
`concurrent.interpreters.create()` + `Interpreter.exec()`
/ `Interpreter.close()` — which *should* handle tstate
cleanly (they're the "blessed" API).
- If so, trio's cache threads are safe to fork from
regardless of whether they've previously driven subints.
- → the `non_trio` qualifier in
`run_fork_in_non_trio_thread` becomes
*overcautious* rather than load-bearing, and the
dedicated-thread primitives in `_subint_forkserver.py`
can likely be replaced with straight
`trio.to_thread.run_sync()` wrappers.
TL;DR
-----
| constraint | fixed by isolated mode? |
|---|---|
| GIL-starvation (class A) | **yes** |
| destroy race on cached worker | unclear — empirical test on py3.14 + isolated API required |
| fork-from-main-tstate requirement on worker | **probably yes, conditional on the destroy-race question above** |
If #2 also resolves on py3.14+ with isolated mode,
tractor could drop the `non_trio` qualifier from the fork
helper's name and just use `trio.to_thread.run_sync(...)`
for everything. But **we shouldn't do that preemptively**
— the current cautious design is cheap (one dedicated
thread per fork / per subint-exec) and correct.
Audit plan when msgspec #563 lands
----------------------------------
Assuming msgspec grows `Py_mod_multiple_interpreters`
support:
1. **Flip `tractor.spawn._subint` to isolated mode.** Drop
the `_interpreters.create('legacy')` call in favor of
the public API (`concurrent.interpreters.create()` +
`Interpreter.exec()` / `Interpreter.close()`). Run the
three `ai/conc-anal/subint_*_issue.md` reproducers —
class-A (`test_stale_entry_is_deleted` etc.) should
pass without the `skipon_spawn_backend('subint')` marks
(revisit the marker inventory).
2. **Empirical destroy-race retest.** In `subint_proc`,
swap the dedicated `threading.Thread` back to
`trio.to_thread.run_sync(Interpreter.exec, ...,
abandon_on_cancel=False)` and run the full subint test
suite. If `Interpreter.close()` (or the backing
destroy) blocks the same way as the legacy version
did, revert and keep the dedicated thread.
3. **If #2 clean**, audit `_subint_forkserver.py`:
- Rename `run_fork_in_non_trio_thread` → drop the
`_non_trio_` qualifier (e.g. `run_fork_in_thread`) or
inline the two-line `trio.to_thread.run_sync` call at
the call sites and drop the helper entirely.
- Consider whether `fork_from_worker_thread` +
`run_subint_in_worker_thread` still warrant being
separate module-level primitives or whether they
collapse into a compound
`trio.to_thread.run_sync`-driven pattern inside the
(future) `subint_forkserver_proc` backend.
4. **Doc fallout.** `subint_sigint_starvation_issue.md`
and `subint_cancel_delivery_hang_issue.md` both cite
the legacy-GIL-sharing architecture as the root cause.
Close them with commit-refs to the isolated-mode
migration. This doc itself should get a closing
post-mortem section noting which of #1/#2/#3 actually
resolved vs persisted.
References
----------
- `tractor.spawn._subint_forkserver` — the in-tree module
whose constraints this doc catalogs.
- `ai/conc-anal/subint_sigint_starvation_issue.md` — the
GIL-starvation class.
- `ai/conc-anal/subint_cancel_delivery_hang_issue.md`
sibling Ctrl-C-able hang class.
- `ai/conc-anal/subint_fork_blocked_by_cpython_post_fork_issue.md`
— why fork-from-subint is blocked (this drives the
forkserver-via-non-subint-thread workaround).
- `ai/conc-anal/subint_fork_from_main_thread_smoketest.py`
— empirical validation for the workaround.
- [PEP 684 — per-interpreter GIL](https://peps.python.org/pep-0684/)
- [PEP 734 — `concurrent.interpreters` public API](https://peps.python.org/pep-0734/)
- [jcrist/msgspec#563 — PEP 684 support tracker](https://github.com/jcrist/msgspec/issues/563)
- tractor issue #379 — subint backend tracking.

View File

@ -0,0 +1,155 @@
---
model: claude-opus-4-7[1m]
service: claude
session: subints-phase-b-hardening-and-fork-block
timestamp: 2026-04-22T20:07:23Z
git_ref: 797f57c
scope: code
substantive: true
raw_file: 20260422T200723Z_797f57c_prompt_io.raw.md
---
## Prompt
Session-spanning work on the Phase B `subint` spawn-backend.
Three distinct sub-phases in one log:
1. **Py3.13 gate tightening** — diagnose a reproducible hang
of subint spawn flow under py3.13 (works on py3.14), trace
to a private `_interpreters` module vintage issue, tighten
our feature gate from "`_interpreters` present" to "public
`concurrent.interpreters` present" (i.e. py3.14+).
2. **Test-harness hardening** — add `pytest-timeout` dep, put
`@pytest.mark.timeout(30, method='thread')` on the
three known-hanging subint tests cataloged in
`ai/conc-anal/subint_sigint_starvation_issue.md`. Separately,
code-review the user's in-flight `skipon_spawn_backend`
marker implementation; find four bugs; refactor to use
`item.iter_markers()`.
3. **`subint_fork` prototype → CPython-block finding** — draft
a WIP `subint_fork_proc` backend using a sub-interpreter as
a launchpad for `os.fork()` (to sidestep trio#1614). User
tests on py3.14, hits
`Fatal Python error: _PyInterpreterState_DeleteExceptMain:
not main interpreter`. Walk CPython sources (local clone at
`~/repos/cpython/`) to pinpoint the refusal
(`Modules/posixmodule.c:728` → `Python/pystate.c:1040`).
Revert implementation to a `NotImplementedError` stub in a
new `_subint_fork.py` submodule, document the finding in a
third `conc-anal/` doc with an upstream-report draft for
the CPython issue tracker. Finally, discuss user's proposed
workaround architecture (main-interp worker-thread
forkserver) and draft a standalone smoke-test script for
feasibility validation.
## Response summary
All three sub-phases landed concrete artifacts:
**Sub-phase 1** — `_subint.py` + `_spawn.py` gates + error
messages updated to require py3.14+ via the public
`concurrent.interpreters` module presence check. Module
docstring revised to explain the empirical reason
(py3.13's private `_interpreters` vintage wedges under
multi-trio-task usage even though minimal standalone
reproducers work fine there). Test-module
`pytest.importorskip` likewise switched.
**Sub-phase 2** — `pytest-timeout>=2.3` added to `testing`
dep group. `@pytest.mark.timeout(30, method='thread')`
applied on:
- `tests/discovery/test_registrar.py::test_stale_entry_is_deleted`
- `tests/test_cancellation.py::test_cancel_while_childs_child_in_sync_sleep`
- `tests/test_cancellation.py::test_multierror_fast_nursery`
- `tests/test_subint_cancellation.py::test_subint_non_checkpointing_child`
`method='thread'` documented inline as load-bearing — the
GIL-starvation path that drops `SIGINT` would equally drop
`SIGALRM`, so only a watchdog-thread timeout can reliably
escape.
`skipon_spawn_backend` plugin refactored into a single
`iter_markers`-driven loop in `pytest_collection_modifyitems`
(~30 LOC replacing ~30 LOC of nested conditionals). Four
bugs dissolved: wrong `.get()` key, module-level `pytestmark`
suppressing per-test marks, unhandled `pytestmark = [list]`
form, `pytest.Makr` typo. Marker help text updated to
document the variadic backend-list + `reason=` kwarg
surface.
**Sub-phase 3** — Prototype drafted (then reverted):
- `tractor/spawn/_subint_fork.py` — new dedicated submodule
housing the `subint_fork_proc` stub. Module docstring +
fn docstring explain the attempt, the CPython-level
block, and the reason for keeping the stub in-tree
(documentation of the attempt + starting point if CPython
ever lifts the restriction).
- `tractor/spawn/_spawn.py``'subint_fork'` registered as a
`SpawnMethodKey` literal + in `_methods`, so
`--spawn-backend=subint_fork` routes to a clean
`NotImplementedError` pointing at the analysis doc rather
than an "invalid backend" error.
- `ai/conc-anal/subint_fork_blocked_by_cpython_post_fork_issue.md`
third sibling conc-anal doc. Full annotated CPython
source walkthrough from user-visible
`Fatal Python error` → `Modules/posixmodule.c:728
PyOS_AfterFork_Child()` → `Python/pystate.c:1040
_PyInterpreterState_DeleteExceptMain()` gate. Includes a
copy-paste-ready upstream-report draft for the CPython
issue tracker with a two-tier ask (ideally "make it work",
minimally "cleaner error than `Fatal Python error`
aborting the child").
- `ai/conc-anal/subint_fork_from_main_thread_smoketest.py`
standalone zero-tractor-import CPython-level smoke test
for the user's proposed workaround architecture
(forkserver on a main-interp worker thread). Four
argparse-driven scenarios: `control_subint_thread_fork`
(reproduces the known-broken case as a test-harness
sanity), `main_thread_fork` (baseline), `worker_thread_fork`
(architectural assertion), `full_architecture`
(end-to-end trio-in-subint in forked child). User will
run on py3.14 next.
## Files changed
See `git log 26fb820..HEAD --stat` for the canonical list.
New files this session:
- `tractor/spawn/_subint_fork.py`
- `ai/conc-anal/subint_fork_blocked_by_cpython_post_fork_issue.md`
- `ai/conc-anal/subint_fork_from_main_thread_smoketest.py`
Modified (diff pointers in raw log):
- `tractor/spawn/_subint.py` (py3.14 gate)
- `tractor/spawn/_spawn.py` (`subint_fork` registration)
- `tractor/_testing/pytest.py` (`skipon_spawn_backend` refactor)
- `pyproject.toml` (`pytest-timeout` dep)
- `tests/discovery/test_registrar.py`,
`tests/test_cancellation.py`,
`tests/test_subint_cancellation.py` (timeout marks,
cross-refs to conc-anal docs)
## Human edits
Several back-and-forth iterations with user-driven
adjustments during the session:
- User corrected my initial mis-classification of
`test_cancel_while_childs_child_in_sync_sleep[subint-False]`
as Ctrl-C-able — second strace showed `EAGAIN`, putting
it squarely in class A (GIL-starvation). Re-analysis
preserved in the raw log.
- User independently fixed the `.get(reason)``.get('reason', reason)`
bug in the marker plugin before my review; preserved their
fix.
- User suggested moving the `subint_fork_proc` stub from
the bottom of `_subint.py` into its own
`_subint_fork.py` submodule — applied.
- User asked to keep the forkserver-architecture
discussion as background for the smoke-test rather than
committing to a tractor-side refactor until the smoke
test validates the CPython-level assumptions.
Commit messages in this range (b025c982 … 797f57c) were
drafted via `/commit-msg` + `rewrap.py --width 67`; user
landed them with the usual review.

View File

@ -0,0 +1,343 @@
---
model: claude-opus-4-7[1m]
service: claude
timestamp: 2026-04-22T20:07:23Z
git_ref: 797f57c
diff_cmd: git log 26fb820..HEAD # all session commits since the destroy-race fix log
---
Session-spanning conversation covering the Phase B hardening
of the `subint` spawn-backend and an investigation into a
proposed `subint_fork` follow-up which turned out to be
blocked at the CPython level. This log is a narrative capture
of the substantive turns (not every message) and references
the concrete code + docs the session produced. Per diff-ref
mode the actual code diffs are pointed at via `git log` on
each ref rather than duplicated inline.
## Narrative of the substantive turns
### Py3.13 hang / gate tightening
Diagnosed a reproducible hang of the `subint` backend under
py3.13 (test_spawning tests wedge after root-actor bringup).
Root cause: py3.13's vintage of the private `_interpreters` C
module has a latent thread/subint-interaction issue that
`_interpreters.exec()` silently fails to progress under
tractor's multi-trio usage pattern — even though a minimal
standalone `threading.Thread` + `_interpreters.exec()`
reproducer works fine on the same Python. Empirically
py3.14 fixes it.
Fix (from this session): tighten the `_has_subints` gate in
`tractor.spawn._subint` from "private module importable" to
"public `concurrent.interpreters` present" — which is 3.14+
only. This leaves `subint_proc()` unchanged in behavior (we
still call the *private* `_interpreters.create('legacy')`
etc. under the hood) but refuses to engage on 3.13.
Also tightened the matching gate in
`tractor.spawn._spawn.try_set_start_method('subint')` and
rev'd the corresponding error messages from "3.13+" to
"3.14+" with a sentence explaining why. Test-module
`pytest.importorskip` switched from `_interpreters`
`concurrent.interpreters` to match.
### `pytest-timeout` dep + `skipon_spawn_backend` marker plumbing
Added `pytest-timeout>=2.3` to the `testing` dep group with
an inline comment pointing at the `ai/conc-anal/*.md` docs.
Applied `@pytest.mark.timeout(30, method='thread')` (the
`method='thread'` is load-bearing — `signal`-method
`SIGALRM` suffers the same GIL-starvation path that drops
`SIGINT` in the class-A hang pattern) to the three known-
hanging subint tests cataloged in
`subint_sigint_starvation_issue.md`.
Separately code-reviewed the user's newly-staged
`skipon_spawn_backend` pytest marker implementation in
`tractor/_testing/pytest.py`. Found four bugs:
1. `modmark.kwargs.get(reason)` called `.get()` with the
*variable* `reason` as the dict key instead of the string
`'reason'` — user-supplied `reason=` was never picked up.
(User had already fixed this locally via `.get('reason',
reason)` by the time my review happened — preserved that
fix.)
2. The module-level `pytestmark` branch suppressed per-test
marker handling (the `else:` was an `else:` rather than
independent iteration).
3. `mod_pytestmark.mark` assumed a single
`MarkDecorator` — broke on the valid-pytest `pytestmark =
[mark, mark]` list form.
4. Typo: `pytest.Makr``pytest.Mark`.
Refactored the hook to use `item.iter_markers(name=...)`
which walks function + class + module scopes uniformly and
handles both `pytestmark` forms natively. ~30 LOC replaced
the original ~30 LOC of nested conditionals, all four bugs
dissolved. Also updated the marker help string to reflect
the variadic `*start_methods` + `reason=` surface.
### `subint_fork_proc` prototype attempt
User's hypothesis: the known trio+`fork()` issues
(python-trio/trio#1614) could be sidestepped by using a
sub-interpreter purely as a launchpad — `os.fork()` from a
subint that has never imported trio → child is in a
trio-free context. In the child `execv()` back into
`python -m tractor._child` and the downstream handshake
matches `trio_proc()` identically.
Drafted the prototype at `tractor/spawn/_subint.py`'s bottom
(originally — later moved to its own submod, see below):
launchpad-subint creation, bootstrap code-string with
`os.fork()` + `execv()`, driver-thread orchestration,
parent-side `ipc_server.wait_for_peer()` dance. Registered
`'subint_fork'` as a new `SpawnMethodKey` literal, added
`case 'subint' | 'subint_fork':` feature-gate arm in
`try_set_start_method()`, added entry in `_methods` dict.
### CPython-level block discovered
User tested on py3.14 and saw:
```
Fatal Python error: _PyInterpreterState_DeleteExceptMain: not main interpreter
Python runtime state: initialized
Current thread 0x00007f6b71a456c0 [subint-fork-lau] (most recent call first):
File "<script>", line 2 in <module>
<script>:2: DeprecationWarning: This process (pid=802985) is multi-threaded, use of fork() may lead to deadlocks in the child.
```
Walked CPython sources (local clone at `~/repos/cpython/`):
- **`Modules/posixmodule.c:728` `PyOS_AfterFork_Child()`** —
post-fork child-side cleanup. Calls
`_PyInterpreterState_DeleteExceptMain(runtime)` with
`goto fatal_error` on non-zero status. Has the
`// Ideally we could guarantee tstate is running main.`
self-acknowledging-fragile comment directly above.
- **`Python/pystate.c:1040`
`_PyInterpreterState_DeleteExceptMain()`** — the
refusal. Hard `PyStatus_ERR("not main interpreter")` gate
when `tstate->interp != interpreters->main`. Docstring
formally declares the precondition ("If there is a
current interpreter state, it *must* be the main
interpreter"). `XXX` comments acknowledge further latent
issues within.
Definitive answer to "Open Question 1" of the prototype
docstring: **no, CPython does not support `os.fork()` from
a non-main sub-interpreter**. Not because the fork syscall
is blocked (it isn't — the parent returns a valid pid),
but because the child cannot survive CPython's post-fork
initialization. This is an enforced invariant, not an
incidental limitation.
### Revert: move to stub submod + doc the finding
Per user request:
1. Reverted the working `subint_fork_proc` body to a
`NotImplementedError` stub, MOVED to its own submod
`tractor/spawn/_subint_fork.py` (keeps `_subint.py`
focused on the working `subint_proc` backend).
2. Updated `_spawn.py` to import the stub from the new
submod path; kept `'subint_fork'` in `SpawnMethodKey` +
`_methods` so `--spawn-backend=subint_fork` routes to a
clean `NotImplementedError` with pointer to the analysis
doc rather than an "invalid backend" error.
3. Wrote
`ai/conc-anal/subint_fork_blocked_by_cpython_post_fork_issue.md`
with the full annotated CPython walkthrough + an
upstream-report draft for the CPython issue tracker.
Draft has a two-tier ask: ideally "make it work"
(pre-fork tstate-swap hook or `DeleteExceptFor(interp)`
variant), minimally "give us a clean `RuntimeError` in
the parent instead of a `Fatal Python error` aborting
the child silently".
### Design discussion — main-interp-thread forkserver workaround
User proposed: set up a "subint forking server" that fork()s
on behalf of subint callers. Core insight: the CPython gate
is on `tstate->interp`, not thread identity, so **any thread
whose tstate is main-interp** can fork cleanly. A worker
thread attached to main-interp (never entering a subint)
satisfies the precondition.
Structurally this is `mp.forkserver` (which tractor already
has as `mp_forkserver`) but **in-process**: instead of a
separate Python subproc as the fork server, we'd put the
forkserver on a thread in the tractor parent process. Pros:
faster spawn (no IPC marshalling to external server + no
separate Python startup), inherits already-imported modules
for free. Cons: less crash isolation (forkserver failure
takes the whole process).
Required tractor-side refactor: move the root actor's
`trio.run()` off main-interp-main-thread (so main-thread can
run the forkserver loop). Nontrivial; approximately the same
magnitude as "Phase C".
The design would also not fully resolve the class-A
GIL-starvation issue because child actors' trio still runs
inside subints (legacy config, msgspec PEP 684 pending).
Would mitigate SIGINT-starvation specifically if signal
handling moves to the forkserver thread.
Recommended pre-commitment: a standalone CPython-only smoke
test validating the four assumptions the arch rests on,
before any tractor-side work.
### Smoke-test script drafted
Wrote `ai/conc-anal/subint_fork_from_main_thread_smoketest.py`:
argparse-driven, four scenarios (`control_subint_thread_fork`
reproducing the known-broken case, `main_thread_fork`
baseline, `worker_thread_fork` the architectural assertion,
`full_architecture` end-to-end with trio in a subint in the
forked child). No `tractor` imports; pure CPython + `_interpreters`
+ `trio`. Bails cleanly on py<3.14. Pass/fail banners per
scenario.
User will validate on their py3.14 env next.
## Per-code-artifact provenance
### `tractor/spawn/_subint_fork.py` (new submod)
> `git show 797f57c -- tractor/spawn/_subint_fork.py`
NotImplementedError stub for the subint-fork backend. Module
docstring + fn docstring explain the attempt, the CPython
block, and why the stub is kept in-tree. No runtime behavior
beyond raising with a pointer at the conc-anal doc.
### `tractor/spawn/_spawn.py` (modified)
> `git log 26fb820..HEAD -- tractor/spawn/_spawn.py`
- Added `'subint_fork'` to `SpawnMethodKey` literal with a
block comment explaining the CPython-level block.
- Generalized the `case 'subint':` arm to `case 'subint' |
'subint_fork':` since both use the same py3.14+ gate.
- Registered `subint_fork_proc` in `_methods` with a
pointer-comment at the analysis doc.
### `tractor/spawn/_subint.py` (modified across session)
> `git log 26fb820..HEAD -- tractor/spawn/_subint.py`
- Tightened `_has_subints` gate: dual-requires public
`concurrent.interpreters` + private `_interpreters`
(tests for py3.14-or-newer on the public-API presence,
then uses the private one for legacy-config subints
because `msgspec` still blocks the public isolated mode
per jcrist/msgspec#563).
- Updated module docstring, `subint_proc()` docstring, and
gate-error messages to reflect the 3.14+ requirement and
the reason (py3.13 wedges under multi-trio usage even
though the private module exists there).
### `tractor/_testing/pytest.py` (modified)
> `git log 26fb820..HEAD -- tractor/_testing/pytest.py`
- New `skipon_spawn_backend(*start_methods, reason=...)`
pytest marker expanded into `pytest.mark.skip(reason=...)`
at collection time via
`pytest_collection_modifyitems()`.
- Implementation uses `item.iter_markers(name=...)` which
walks function + class + module scopes uniformly and
handles both `pytestmark = <single Mark>` and
`pytestmark = [mark, ...]` forms natively. ~30-LOC
single-loop refactor replacing a prior nested
conditional that had four bugs (see "Review" narrative
above).
- Added `pytest.Config` / `pytest.Function` /
`pytest.FixtureRequest` type annotations on fixture
signatures while touching the file.
### `pyproject.toml` (modified)
> `git log 26fb820..HEAD -- pyproject.toml`
Added `pytest-timeout>=2.3` to `testing` dep group with
comment pointing at the `ai/conc-anal/` docs.
### `tests/discovery/test_registrar.py`,
`tests/test_subint_cancellation.py`,
`tests/test_cancellation.py` (modified)
> `git log 26fb820..HEAD -- tests/`
Applied `@pytest.mark.timeout(30, method='thread')` on
known-hanging subint tests. Extended comments to cross-
reference the `ai/conc-anal/*.md` docs. `method='thread'`
is documented inline as load-bearing (`signal`-method
SIGALRM suffers the same GIL-starvation path that drops
SIGINT).
### `ai/conc-anal/subint_fork_blocked_by_cpython_post_fork_issue.md` (new)
> `git show 797f57c -- ai/conc-anal/subint_fork_blocked_by_cpython_post_fork_issue.md`
Third sibling doc under `conc-anal/`. Structure: TL;DR,
context ("what we tried"), symptom (the user's exact
`Fatal Python error` output), CPython source walkthrough
with excerpted snippets from `posixmodule.c` +
`pystate.c`, chain summary, definitive answer to Open
Question 1, `## Upstream-report draft (for CPython issue
tracker)` section with a two-tier ask, references.
### `ai/conc-anal/subint_fork_from_main_thread_smoketest.py` (new, THIS turn)
Zero-tractor-import smoke test for the proposed workaround
architecture. Four argparse-driven scenarios covering the
control case + baseline + arch-critical case + end-to-end.
Pass/fail banners per scenario; clean `--help` output;
py3.13 early-exit.
## Non-code output (verbatim)
### The `strace` signature that kicked off the CPython
walkthrough
```
--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
write(16, "\2", 1) = -1 EAGAIN (Resource temporarily unavailable)
rt_sigreturn({mask=[WINCH]}) = 139801964688928
```
### Key user quotes framing the direction
> ok actually we get this [fatal error] ... see if you can
> take a look at what's going on, in particular wrt to
> cpython's sources. pretty sure there's a local copy at
> ~/repos/cpython/
(Drove the CPython walkthrough that produced the
definitive refusal chain.)
> is there any reason we can't just sidestep this "must fork
> from main thread in main subint" issue by simply ensuring
> a "subint forking server" is always setup prior to
> invoking trio in a non-main-thread subint ...
(Drove the main-interp-thread-forkserver architectural
discussion + smoke-test script design.)
### CPython source tags for quick jump-back
```
Modules/posixmodule.c:728 PyOS_AfterFork_Child()
Modules/posixmodule.c:753 // Ideally we could guarantee tstate is running main.
Modules/posixmodule.c:778 status = _PyInterpreterState_DeleteExceptMain(runtime);
Python/pystate.c:1040 _PyInterpreterState_DeleteExceptMain()
Python/pystate.c:1044-1047 tstate->interp != main → PyStatus_ERR("not main interpreter")
```

View File

@ -139,7 +139,9 @@ def pytest_addoption(
@pytest.fixture(scope='session', autouse=True) @pytest.fixture(scope='session', autouse=True)
def loglevel(request) -> str: def loglevel(
request: pytest.FixtureRequest,
) -> str:
import tractor import tractor
orig = tractor.log._default_loglevel orig = tractor.log._default_loglevel
level = tractor.log._default_loglevel = request.config.option.loglevel level = tractor.log._default_loglevel = request.config.option.loglevel
@ -156,7 +158,7 @@ def loglevel(request) -> str:
@pytest.fixture(scope='function') @pytest.fixture(scope='function')
def test_log( def test_log(
request, request: pytest.FixtureRequest,
loglevel: str, loglevel: str,
) -> tractor.log.StackLevelAdapter: ) -> tractor.log.StackLevelAdapter:
''' '''

View File

@ -146,13 +146,12 @@ def spawn(
ids='ctl-c={}'.format, ids='ctl-c={}'.format,
) )
def ctlc( def ctlc(
request, request: pytest.FixtureRequest,
ci_env: bool, ci_env: bool,
) -> bool: ) -> bool:
use_ctlc = request.param use_ctlc: bool = request.param
node = request.node node = request.node
markers = node.own_markers markers = node.own_markers
for mark in markers: for mark in markers:

View File

@ -520,8 +520,6 @@ async def kill_transport(
# @pytest.mark.parametrize('use_signal', [False, True])
#
# Wall-clock bound via `pytest-timeout` (`method='thread'`). # Wall-clock bound via `pytest-timeout` (`method='thread'`).
# Under `--spawn-backend=subint` this test can wedge in an # Under `--spawn-backend=subint` this test can wedge in an
# un-Ctrl-C-able state (abandoned-subint + shared-GIL # un-Ctrl-C-able state (abandoned-subint + shared-GIL
@ -537,6 +535,16 @@ async def kill_transport(
3, # NOTE should be a 2.1s happy path. 3, # NOTE should be a 2.1s happy path.
method='thread', method='thread',
) )
@pytest.mark.skipon_spawn_backend(
'subint',
reason=(
'XXX SUBINT HANGING TEST XXX\n'
'See oustanding issue(s)\n'
# TODO, put issue link!
)
)
# @pytest.mark.parametrize('use_signal', [False, True])
#
def test_stale_entry_is_deleted( def test_stale_entry_is_deleted(
debug_mode: bool, debug_mode: bool,
daemon: subprocess.Popen, daemon: subprocess.Popen,
@ -549,12 +557,6 @@ def test_stale_entry_is_deleted(
stale entry and not delivering a bad portal. stale entry and not delivering a bad portal.
''' '''
if start_method == 'subint':
pytest.skip(
'XXX SUBINT HANGING TEST XXX\n'
'See oustanding issue(s)\n'
)
async def main(): async def main():
name: str = 'transport_fails_actor' name: str = 'transport_fails_actor'

View File

View File

@ -0,0 +1,329 @@
'''
Integration exercises for the `tractor.spawn._subint_forkserver`
submodule at three tiers:
1. the low-level primitives
(`fork_from_worker_thread()` +
`run_subint_in_worker_thread()`) driven from inside a real
`trio.run()` in the parent process,
2. the full `subint_forkserver_proc` spawn backend wired
through tractor's normal actor-nursery + portal-RPC
machinery i.e. `open_root_actor` + `open_nursery` +
`run_in_actor` against a subactor spawned via fork from a
main-interp worker thread.
Background
----------
`ai/conc-anal/subint_fork_blocked_by_cpython_post_fork_issue.md`
establishes that `os.fork()` from a non-main sub-interpreter
aborts the child at the CPython level. The sibling
`subint_fork_from_main_thread_smoketest.py` proves the escape
hatch: fork from a main-interp *worker thread* (one that has
never entered a subint) works, and the forked child can then
host its own `trio.run()` inside a fresh subint.
Those smoke-test scenarios are standalone no trio runtime
in the *parent*. Tiers (1)+(2) here cover the primitives
driven from inside `trio.run()` in the parent, and tier (3)
(the `*_spawn_basic` test) drives the registered
`subint_forkserver` spawn backend end-to-end against the
tractor runtime.
Gating
------
- py3.14+ (via `concurrent.interpreters` presence)
- no `--spawn-backend` restriction the backend-level test
flips `tractor.spawn._spawn._spawn_method` programmatically
(via `try_set_start_method('subint_forkserver')`) and
restores it on teardown, so these tests are independent of
the session-level CLI backend choice.
'''
from __future__ import annotations
from functools import partial
import os
import pytest
import trio
import tractor
from tractor.devx import dump_on_hang
# Gate: subint forkserver primitives require py3.14+. Check
# the public stdlib wrapper's presence (added in 3.14) rather
# than `_interpreters` directly — see
# `tractor.spawn._subint` for why.
pytest.importorskip('concurrent.interpreters')
from tractor.spawn._subint_forkserver import ( # noqa: E402
fork_from_worker_thread,
run_subint_in_worker_thread,
wait_child,
)
from tractor.spawn import _spawn as _spawn_mod # noqa: E402
from tractor.spawn._spawn import try_set_start_method # noqa: E402
# ----------------------------------------------------------------
# child-side callables (passed via `child_target=` across fork)
# ----------------------------------------------------------------
_CHILD_TRIO_BOOTSTRAP: str = (
'import trio\n'
'async def _main():\n'
' await trio.sleep(0.05)\n'
' return 42\n'
'result = trio.run(_main)\n'
'assert result == 42, f"trio.run returned {result}"\n'
)
def _child_trio_in_subint() -> int:
'''
`child_target` for the trio-in-child scenario: drive a
trivial `trio.run()` inside a fresh legacy-config subint
on a worker thread.
Returns an exit code suitable for `os._exit()`:
- 0: subint-hosted `trio.run()` succeeded
- 3: driver thread hang (timeout inside `run_subint_in_worker_thread`)
- 4: subint bootstrap raised some other exception
'''
try:
run_subint_in_worker_thread(
_CHILD_TRIO_BOOTSTRAP,
thread_name='child-subint-trio-thread',
)
except RuntimeError:
# timeout / thread-never-returned
return 3
except BaseException:
return 4
return 0
# ----------------------------------------------------------------
# parent-side harnesses (run inside `trio.run()`)
# ----------------------------------------------------------------
async def run_fork_in_non_trio_thread(
deadline: float,
*,
child_target=None,
) -> int:
'''
From inside a parent `trio.run()`, off-load the
forkserver primitive to a main-interp worker thread via
`trio.to_thread.run_sync()` and return the forked child's
pid.
Then `wait_child()` on that pid (also off-loaded so we
don't block trio's event loop on `waitpid()`) and assert
the child exited cleanly.
'''
with trio.fail_after(deadline):
# NOTE: `fork_from_worker_thread` internally spawns its
# own dedicated `threading.Thread` (not from trio's
# cache) and joins it before returning — so we can
# safely off-load via `to_thread.run_sync` without
# worrying about the trio-thread-cache recycling the
# runner. Pass `abandon_on_cancel=False` for the
# same "bounded + clean" rationale we use in
# `_subint.subint_proc`.
pid: int = await trio.to_thread.run_sync(
partial(
fork_from_worker_thread,
child_target,
thread_name='test-subint-forkserver',
),
abandon_on_cancel=False,
)
assert pid > 0
ok, status_str = await trio.to_thread.run_sync(
partial(
wait_child,
pid,
expect_exit_ok=True,
),
abandon_on_cancel=False,
)
assert ok, (
f'forked child did not exit cleanly: '
f'{status_str}'
)
return pid
# ----------------------------------------------------------------
# tests
# ----------------------------------------------------------------
# Bounded wall-clock via `pytest-timeout` (`method='thread'`)
# for the usual GIL-hostage safety reason documented in the
# sibling `test_subint_cancellation.py` / the class-A
# `subint_sigint_starvation_issue.md`. Each test also has an
# inner `trio.fail_after()` so assertion failures fire fast
# under normal conditions.
@pytest.mark.timeout(30, method='thread')
def test_fork_from_worker_thread_via_trio() -> None:
'''
Baseline: inside `trio.run()`, call
`fork_from_worker_thread()` via `trio.to_thread.run_sync()`,
get a child pid back, reap the child cleanly.
No trio-in-child. If this regresses we know the parent-
side trioworker-thread plumbing is broken independent
of any child-side subint machinery.
'''
deadline: float = 10.0
with dump_on_hang(
seconds=deadline,
path='/tmp/subint_forkserver_baseline.dump',
):
pid: int = trio.run(
partial(run_fork_in_non_trio_thread, deadline),
)
# parent-side sanity — we got a real pid back.
assert isinstance(pid, int) and pid > 0
# by now the child has been waited on; it shouldn't be
# reap-able again.
with pytest.raises((ChildProcessError, OSError)):
os.waitpid(pid, os.WNOHANG)
@pytest.mark.timeout(30, method='thread')
def test_fork_and_run_trio_in_child() -> None:
'''
End-to-end: inside the parent's `trio.run()`, off-load
`fork_from_worker_thread()` to a worker thread, have the
forked child then create a fresh subint and run
`trio.run()` inside it on yet another worker thread.
This is the full "forkserver + trio-in-subint-in-child"
pattern the proposed `subint_forkserver` spawn backend
would rest on.
'''
deadline: float = 15.0
with dump_on_hang(
seconds=deadline,
path='/tmp/subint_forkserver_trio_in_child.dump',
):
pid: int = trio.run(
partial(
run_fork_in_non_trio_thread,
deadline,
child_target=_child_trio_in_subint,
),
)
assert isinstance(pid, int) and pid > 0
# ----------------------------------------------------------------
# tier-3 backend test: drive the registered `subint_forkserver`
# spawn backend end-to-end through tractor's actor-nursery +
# portal-RPC machinery.
# ----------------------------------------------------------------
async def _trivial_rpc() -> str:
'''
Minimal subactor-side RPC body: just return a sentinel
string the parent can assert on.
'''
return 'hello from subint-forkserver child'
async def _happy_path_forkserver(
reg_addr: tuple[str, int | str],
deadline: float,
) -> None:
'''
Parent-side harness: stand up a root actor, open an actor
nursery, spawn one subactor via the currently-selected
spawn backend (which this test will have flipped to
`subint_forkserver`), run a trivial RPC through its
portal, assert the round-trip result.
'''
with trio.fail_after(deadline):
async with (
tractor.open_root_actor(
registry_addrs=[reg_addr],
),
tractor.open_nursery() as an,
):
portal: tractor.Portal = await an.run_in_actor(
_trivial_rpc,
name='subint-forkserver-child',
)
result: str = await portal.wait_for_result()
assert result == 'hello from subint-forkserver child'
@pytest.fixture
def forkserver_spawn_method():
'''
Flip `tractor.spawn._spawn._spawn_method` to
`'subint_forkserver'` for the duration of a test, then
restore whatever was in place before (usually the
session-level CLI choice, typically `'trio'`).
Without this, other tests in the same session would
observe the global flip and start spawning via fork
which is almost certainly NOT what their assertions were
written against.
'''
prev_method: str = _spawn_mod._spawn_method
prev_ctx = _spawn_mod._ctx
try_set_start_method('subint_forkserver')
try:
yield
finally:
_spawn_mod._spawn_method = prev_method
_spawn_mod._ctx = prev_ctx
@pytest.mark.timeout(60, method='thread')
def test_subint_forkserver_spawn_basic(
reg_addr: tuple[str, int | str],
forkserver_spawn_method,
) -> None:
'''
Happy-path: spawn ONE subactor via the
`subint_forkserver` backend (parent-side fork from a
main-interp worker thread), do a trivial portal-RPC
round-trip, tear the nursery down cleanly.
If this passes, the "forkserver + tractor runtime" arch
is proven end-to-end: the registered
`subint_forkserver_proc` spawn target successfully
forks a child, the child runs `_actor_child_main()` +
completes IPC handshake + serves an RPC, and the parent
reaps via `_ForkedProc.wait()` without regressing any of
the normal nursery teardown invariants.
'''
deadline: float = 20.0
with dump_on_hang(
seconds=deadline,
path='/tmp/subint_forkserver_spawn_basic.dump',
):
trio.run(
partial(
_happy_path_forkserver,
reg_addr,
deadline,
),
)

View File

@ -21,6 +21,16 @@ _non_linux: bool = platform.system() != 'Linux'
_friggin_windows: bool = platform.system() == 'Windows' _friggin_windows: bool = platform.system() == 'Windows'
pytestmark = pytest.mark.skipon_spawn_backend(
'subint',
reason=(
'XXX SUBINT HANGING TEST XXX\n'
'See oustanding issue(s)\n'
# TODO, put issue link!
)
)
async def assert_err(delay=0): async def assert_err(delay=0):
await trio.sleep(delay) await trio.sleep(delay)
assert 0 assert 0
@ -110,8 +120,17 @@ def test_remote_error(reg_addr, args_err):
assert exc.boxed_type == errtype assert exc.boxed_type == errtype
# @pytest.mark.skipon_spawn_backend(
# 'subint',
# reason=(
# 'XXX SUBINT HANGING TEST XXX\n'
# 'See oustanding issue(s)\n'
# # TODO, put issue link!
# )
# )
def test_multierror( def test_multierror(
reg_addr: tuple[str, int], reg_addr: tuple[str, int],
start_method: str,
): ):
''' '''
Verify we raise a ``BaseExceptionGroup`` out of a nursery where Verify we raise a ``BaseExceptionGroup`` out of a nursery where
@ -141,15 +160,28 @@ def test_multierror(
trio.run(main) trio.run(main)
@pytest.mark.parametrize('delay', (0, 0.5))
@pytest.mark.parametrize( @pytest.mark.parametrize(
'num_subactors', range(25, 26), 'delay',
(0, 0.5),
ids='delays={}'.format,
) )
def test_multierror_fast_nursery(reg_addr, start_method, num_subactors, delay): @pytest.mark.parametrize(
"""Verify we raise a ``BaseExceptionGroup`` out of a nursery where 'num_subactors',
range(25, 26),
ids= 'num_subs={}'.format,
)
def test_multierror_fast_nursery(
reg_addr: tuple,
start_method: str,
num_subactors: int,
delay: float,
):
'''
Verify we raise a ``BaseExceptionGroup`` out of a nursery where
more then one actor errors and also with a delay before failure more then one actor errors and also with a delay before failure
to test failure during an ongoing spawning. to test failure during an ongoing spawning.
"""
'''
async def main(): async def main():
async with tractor.open_nursery( async with tractor.open_nursery(
registry_addrs=[reg_addr], registry_addrs=[reg_addr],
@ -189,8 +221,15 @@ async def do_nothing():
pass pass
@pytest.mark.parametrize('mechanism', ['nursery_cancel', KeyboardInterrupt]) @pytest.mark.parametrize(
def test_cancel_single_subactor(reg_addr, mechanism): 'mechanism', [
'nursery_cancel',
KeyboardInterrupt,
])
def test_cancel_single_subactor(
reg_addr: tuple,
mechanism: str|KeyboardInterrupt,
):
''' '''
Ensure a ``ActorNursery.start_actor()`` spawned subactor Ensure a ``ActorNursery.start_actor()`` spawned subactor
cancels when the nursery is cancelled. cancels when the nursery is cancelled.
@ -232,9 +271,12 @@ async def stream_forever():
await trio.sleep(0.01) await trio.sleep(0.01)
@tractor_test @tractor_test(
async def test_cancel_infinite_streamer(start_method): timeout=6,
)
async def test_cancel_infinite_streamer(
start_method: str
):
# stream for at most 1 seconds # stream for at most 1 seconds
with ( with (
trio.fail_after(4), trio.fail_after(4),
@ -257,6 +299,14 @@ async def test_cancel_infinite_streamer(start_method):
assert n.cancelled assert n.cancelled
# @pytest.mark.skipon_spawn_backend(
# 'subint',
# reason=(
# 'XXX SUBINT HANGING TEST XXX\n'
# 'See oustanding issue(s)\n'
# # TODO, put issue link!
# )
# )
@pytest.mark.parametrize( @pytest.mark.parametrize(
'num_actors_and_errs', 'num_actors_and_errs',
[ [
@ -286,7 +336,9 @@ async def test_cancel_infinite_streamer(start_method):
'no_daemon_actors_fail_all_run_in_actors_sleep_then_fail', 'no_daemon_actors_fail_all_run_in_actors_sleep_then_fail',
], ],
) )
@tractor_test @tractor_test(
timeout=10,
)
async def test_some_cancels_all( async def test_some_cancels_all(
num_actors_and_errs: tuple, num_actors_and_errs: tuple,
start_method: str, start_method: str,
@ -370,7 +422,10 @@ async def test_some_cancels_all(
pytest.fail("Should have gotten a remote assertion error?") pytest.fail("Should have gotten a remote assertion error?")
async def spawn_and_error(breadth, depth) -> None: async def spawn_and_error(
breadth: int,
depth: int,
) -> None:
name = tractor.current_actor().name name = tractor.current_actor().name
async with tractor.open_nursery() as nursery: async with tractor.open_nursery() as nursery:
for i in range(breadth): for i in range(breadth):
@ -396,7 +451,10 @@ async def spawn_and_error(breadth, depth) -> None:
@tractor_test @tractor_test
async def test_nested_multierrors(loglevel, start_method): async def test_nested_multierrors(
loglevel: str,
start_method: str,
):
''' '''
Test that failed actor sets are wrapped in `BaseExceptionGroup`s. This Test that failed actor sets are wrapped in `BaseExceptionGroup`s. This
test goes only 2 nurseries deep but we should eventually have tests test goes only 2 nurseries deep but we should eventually have tests
@ -483,20 +541,21 @@ async def test_nested_multierrors(loglevel, start_method):
@no_windows @no_windows
def test_cancel_via_SIGINT( def test_cancel_via_SIGINT(
loglevel, loglevel: str,
start_method, start_method: str,
spawn_backend,
): ):
"""Ensure that a control-C (SIGINT) signal cancels both the parent and '''
Ensure that a control-C (SIGINT) signal cancels both the parent and
child processes in trionic fashion child processes in trionic fashion
"""
'''
pid: int = os.getpid() pid: int = os.getpid()
async def main(): async def main():
with trio.fail_after(2): with trio.fail_after(2):
async with tractor.open_nursery() as tn: async with tractor.open_nursery() as tn:
await tn.start_actor('sucka') await tn.start_actor('sucka')
if 'mp' in spawn_backend: if 'mp' in start_method:
time.sleep(0.1) time.sleep(0.1)
os.kill(pid, signal.SIGINT) os.kill(pid, signal.SIGINT)
await trio.sleep_forever() await trio.sleep_forever()
@ -580,6 +639,14 @@ async def spawn_sub_with_sync_blocking_task():
print('exiting first subactor layer..\n') print('exiting first subactor layer..\n')
# @pytest.mark.skipon_spawn_backend(
# 'subint',
# reason=(
# 'XXX SUBINT HANGING TEST XXX\n'
# 'See oustanding issue(s)\n'
# # TODO, put issue link!
# )
# )
@pytest.mark.parametrize( @pytest.mark.parametrize(
'man_cancel_outer', 'man_cancel_outer',
[ [
@ -694,7 +761,7 @@ def test_cancel_while_childs_child_in_sync_sleep(
def test_fast_graceful_cancel_when_spawn_task_in_soft_proc_wait_for_daemon( def test_fast_graceful_cancel_when_spawn_task_in_soft_proc_wait_for_daemon(
start_method, start_method: str,
): ):
''' '''
This is a very subtle test which demonstrates how cancellation This is a very subtle test which demonstrates how cancellation

View File

@ -26,6 +26,15 @@ from tractor._testing import (
from .conftest import cpu_scaling_factor from .conftest import cpu_scaling_factor
pytestmark = pytest.mark.skipon_spawn_backend(
'subint',
reason=(
'XXX SUBINT GIL-CONTENTION HANGING TEST XXX\n'
'See oustanding issue(s)\n'
# TODO, put issue link!
)
)
# XXX TODO cases: # XXX TODO cases:
# - [x] WE cancelled the peer and thus should not see any raised # - [x] WE cancelled the peer and thus should not see any raised
# `ContextCancelled` as it should be reaped silently? # `ContextCancelled` as it should be reaped silently?

View File

@ -7,6 +7,14 @@ import tractor
from tractor.experimental import msgpub from tractor.experimental import msgpub
from tractor._testing import tractor_test from tractor._testing import tractor_test
pytestmark = pytest.mark.skipon_spawn_backend(
'subint',
reason=(
'XXX SUBINT HANGING TEST XXX\n'
'See oustanding issue(s)\n'
# TODO, put issue link!
)
)
def test_type_checks(): def test_type_checks():

View File

@ -14,6 +14,14 @@ from tractor.ipc._shm import (
attach_shm_list, attach_shm_list,
) )
pytestmark = pytest.mark.skipon_spawn_backend(
'subint',
reason=(
'XXX SUBINT GIL-CONTENTION HANGING TEST XXX\n'
'See oustanding issue(s)\n'
# TODO, put issue link!
)
)
@tractor.context @tractor.context
async def child_attach_shml_alot( async def child_attach_shml_alot(

View File

@ -161,6 +161,14 @@ def test_subint_happy_teardown(
trio.run(partial(_happy_path, reg_addr, deadline)) trio.run(partial(_happy_path, reg_addr, deadline))
@pytest.mark.skipon_spawn_backend(
'subint',
reason=(
'XXX SUBINT HANGING TEST XXX\n'
'See oustanding issue(s)\n'
# TODO, put issue link!
)
)
# Wall-clock bound via `pytest-timeout` (`method='thread'`) # Wall-clock bound via `pytest-timeout` (`method='thread'`)
# as defense-in-depth over the inner `trio.fail_after(15)`. # as defense-in-depth over the inner `trio.fail_after(15)`.
# Under the orphaned-channel hang class described in # Under the orphaned-channel hang class described in

View File

@ -224,8 +224,10 @@ def pytest_addoption(
) )
def pytest_configure(config): def pytest_configure(
backend = config.option.spawn_backend config: pytest.Config,
):
backend: str = config.option.spawn_backend
from tractor.spawn._spawn import try_set_start_method from tractor.spawn._spawn import try_set_start_method
try: try:
try_set_start_method(backend) try_set_start_method(backend)
@ -241,10 +243,52 @@ def pytest_configure(config):
'markers', 'markers',
'no_tpt(proto_key): test will (likely) not behave with tpt backend' 'no_tpt(proto_key): test will (likely) not behave with tpt backend'
) )
config.addinivalue_line(
'markers',
'skipon_spawn_backend(*start_methods, reason=None): '
'skip this test under any of the given `--spawn-backend` '
'values; useful for backend-specific known-hang / -borked '
'cases (e.g. the `subint` GIL-starvation class documented '
'in `ai/conc-anal/subint_sigint_starvation_issue.md`).'
)
def pytest_collection_modifyitems(
config: pytest.Config,
items: list[pytest.Function],
):
'''
Expand any `@pytest.mark.skipon_spawn_backend('<backend>'[,
...], reason='...')` markers into concrete
`pytest.mark.skip(reason=...)` calls for tests whose
backend-arg set contains the active `--spawn-backend`.
Uses `item.iter_markers(name=...)` which walks function +
class + module-level marks in the correct scope order (and
handles both the single-`MarkDecorator` and `list[Mark]`
forms of a module-level `pytestmark`) so the same marker
works at any level a user puts it.
'''
backend: str = config.option.spawn_backend
default_reason: str = f'Borked on --spawn-backend={backend!r}'
for item in items:
for mark in item.iter_markers(name='skipon_spawn_backend'):
if backend in mark.args:
reason: str = mark.kwargs.get(
'reason',
default_reason,
)
item.add_marker(pytest.mark.skip(reason=reason))
# first matching mark wins; no value in stacking
# multiple `skip`s on the same item.
break
@pytest.fixture(scope='session') @pytest.fixture(scope='session')
def debug_mode(request) -> bool: def debug_mode(
request: pytest.FixtureRequest,
) -> bool:
''' '''
Flag state for whether `--tpdb` (for `tractor`-py-debugger) Flag state for whether `--tpdb` (for `tractor`-py-debugger)
was passed to the test run. was passed to the test run.
@ -258,12 +302,16 @@ def debug_mode(request) -> bool:
@pytest.fixture(scope='session') @pytest.fixture(scope='session')
def spawn_backend(request) -> str: def spawn_backend(
request: pytest.FixtureRequest,
) -> str:
return request.config.option.spawn_backend return request.config.option.spawn_backend
@pytest.fixture(scope='session') @pytest.fixture(scope='session')
def tpt_protos(request) -> list[str]: def tpt_protos(
request: pytest.FixtureRequest,
) -> list[str]:
# allow quoting on CLI # allow quoting on CLI
proto_keys: list[str] = [ proto_keys: list[str] = [
@ -291,7 +339,7 @@ def tpt_protos(request) -> list[str]:
autouse=True, autouse=True,
) )
def tpt_proto( def tpt_proto(
request, request: pytest.FixtureRequest,
tpt_protos: list[str], tpt_protos: list[str],
) -> str: ) -> str:
proto_key: str = tpt_protos[0] proto_key: str = tpt_protos[0]
@ -343,7 +391,6 @@ def pytest_generate_tests(
metafunc: pytest.Metafunc, metafunc: pytest.Metafunc,
): ):
spawn_backend: str = metafunc.config.option.spawn_backend spawn_backend: str = metafunc.config.option.spawn_backend
if not spawn_backend: if not spawn_backend:
# XXX some weird windows bug with `pytest`? # XXX some weird windows bug with `pytest`?
spawn_backend = 'trio' spawn_backend = 'trio'

View File

@ -63,6 +63,22 @@ SpawnMethodKey = Literal[
'mp_spawn', 'mp_spawn',
'mp_forkserver', # posix only 'mp_forkserver', # posix only
'subint', # py3.14+ via `concurrent.interpreters` (PEP 734) 'subint', # py3.14+ via `concurrent.interpreters` (PEP 734)
# EXPERIMENTAL — blocked at the CPython level. The
# design goal was a `trio+fork`-safe subproc spawn via
# `os.fork()` from a trio-free launchpad sub-interpreter,
# but CPython's `PyOS_AfterFork_Child` → `_PyInterpreterState_DeleteExceptMain`
# requires fork come from the main interp. See
# `tractor.spawn._subint_fork` +
# `ai/conc-anal/subint_fork_blocked_by_cpython_post_fork_issue.md`
# + issue #379 for the full analysis.
'subint_fork',
# EXPERIMENTAL — the `subint_fork` workaround. `os.fork()`
# from a non-trio worker thread (never entered a subint)
# is CPython-legal and works cleanly; forked child runs
# `tractor._child._actor_child_main()` against a trio
# runtime, exactly like `trio_proc` but via fork instead
# of subproc-exec. See `tractor.spawn._subint_forkserver`.
'subint_forkserver',
] ]
_spawn_method: SpawnMethodKey = 'trio' _spawn_method: SpawnMethodKey = 'trio'
@ -115,15 +131,14 @@ def try_set_start_method(
case 'trio': case 'trio':
_ctx = None _ctx = None
case 'subint': case 'subint' | 'subint_fork' | 'subint_forkserver':
# subints need no `mp.context`; feature-gate on the # All subint-family backends need no `mp.context`;
# py3.14 public `concurrent.interpreters` wrapper # all three feature-gate on the py3.14 public
# (PEP 734). We actually drive the private # `concurrent.interpreters` wrapper (PEP 734). See
# `_interpreters` C module in legacy mode — see # `tractor.spawn._subint` for the detailed
# `tractor.spawn._subint` for why — but py3.13's # reasoning. `subint_fork` is blocked at the
# vintage of that private module hangs under our # CPython level (raises `NotImplementedError`);
# multi-trio usage, so we refuse it via the public- # `subint_forkserver` is the working workaround.
# module presence check.
from ._subint import _has_subints from ._subint import _has_subints
if not _has_subints: if not _has_subints:
raise RuntimeError( raise RuntimeError(
@ -461,6 +476,8 @@ async def new_proc(
from ._trio import trio_proc from ._trio import trio_proc
from ._mp import mp_proc from ._mp import mp_proc
from ._subint import subint_proc from ._subint import subint_proc
from ._subint_fork import subint_fork_proc
from ._subint_forkserver import subint_forkserver_proc
# proc spawning backend target map # proc spawning backend target map
@ -469,4 +486,14 @@ _methods: dict[SpawnMethodKey, Callable] = {
'mp_spawn': mp_proc, 'mp_spawn': mp_proc,
'mp_forkserver': mp_proc, 'mp_forkserver': mp_proc,
'subint': subint_proc, 'subint': subint_proc,
# blocked at CPython level — see `_subint_fork.py` +
# `ai/conc-anal/subint_fork_blocked_by_cpython_post_fork_issue.md`.
# Kept here so `--spawn-backend=subint_fork` routes to a
# clean `NotImplementedError` with pointer to the analysis,
# rather than an "invalid backend" error.
'subint_fork': subint_fork_proc,
# WIP — fork-from-non-trio-worker-thread, works on py3.14+
# (validated via `ai/conc-anal/subint_fork_from_main_thread_smoketest.py`).
# See `tractor.spawn._subint_forkserver`.
'subint_forkserver': subint_forkserver_proc,
} }

View File

@ -431,3 +431,5 @@ async def subint_proc(
finally: finally:
if not cancelled_during_spawn: if not cancelled_during_spawn:
actor_nursery._children.pop(uid, None) actor_nursery._children.pop(uid, None)

View File

@ -0,0 +1,153 @@
# tractor: structured concurrent "actors".
# Copyright 2018-eternity Tyler Goodlet.
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU Affero General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU Affero General Public License for more details.
# You should have received a copy of the GNU Affero General Public License
# along with this program. If not, see <https://www.gnu.org/licenses/>.
'''
`subint_fork` spawn backend BLOCKED at CPython level.
The idea was to use a sub-interpreter purely as a launchpad
from which to call `os.fork()`, sidestepping the well-known
trio+fork issues (python-trio/trio#1614 etc.) by guaranteeing
the forking interp had never imported `trio`.
**IT DOES NOT WORK ON CURRENT CPYTHON.** The fork syscall
itself succeeds (in the parent), but the forked CHILD
process aborts immediately during CPython's post-fork
cleanup `PyOS_AfterFork_Child()` calls
`_PyInterpreterState_DeleteExceptMain()` which refuses to
operate when the current tstate belongs to a non-main
sub-interpreter.
Full annotated walkthrough from the user-visible error
(`Fatal Python error: _PyInterpreterState_DeleteExceptMain:
not main interpreter`) down to the specific CPython source
lines that enforce this is in
`ai/conc-anal/subint_fork_blocked_by_cpython_post_fork_issue.md`.
We keep this submodule as a dedicated documentation of the
attempt. If CPython ever lifts the restriction (e.g., via a
force-destroy primitive or a hook that swaps tstate to main
pre-fork), the structural sketch preserved in this file's
git history is a concrete starting point for a working impl.
See also: issue #379's "Our own thoughts, ideas for
`fork()`-workaround/hacks..." section.
'''
from __future__ import annotations
import sys
from typing import (
Any,
TYPE_CHECKING,
)
import trio
from trio import TaskStatus
from tractor.runtime._portal import Portal
from ._subint import _has_subints
if TYPE_CHECKING:
from tractor.discovery._addr import UnwrappedAddress
from tractor.runtime._runtime import Actor
from tractor.runtime._supervise import ActorNursery
async def subint_fork_proc(
name: str,
actor_nursery: ActorNursery,
subactor: Actor,
errors: dict[tuple[str, str], Exception],
bind_addrs: list[UnwrappedAddress],
parent_addr: UnwrappedAddress,
_runtime_vars: dict[str, Any],
*,
infect_asyncio: bool = False,
task_status: TaskStatus[Portal] = trio.TASK_STATUS_IGNORED,
proc_kwargs: dict[str, any] = {},
) -> None:
'''
EXPERIMENTAL currently blocked by a CPython invariant.
Attempted design
----------------
1. Parent creates a fresh legacy-config subint.
2. A worker OS-thread drives the subint through a
bootstrap that calls `os.fork()`.
3. In the forked CHILD, `os.execv()` back into
`python -m tractor._child` (fresh process).
4. In the fork-PARENT, the launchpad subint is destroyed;
parent-side trio task proceeds identically to
`trio_proc()` (wait for child connect-back, send
`SpawnSpec`, yield `Portal`, etc.).
Why it doesn't work
-------------------
CPython's `PyOS_AfterFork_Child()` (in
`Modules/posixmodule.c`) calls
`_PyInterpreterState_DeleteExceptMain()` (in
`Python/pystate.c`) as part of post-fork cleanup. That
function requires the current `PyThreadState` belong to
the **main** interpreter. When `os.fork()` is called
from within a sub-interpreter, the child wakes up with
its tstate still pointing at the (now-stale) subint, and
this check fails with `PyStatus_ERR("not main
interpreter")`, triggering a `fatal_error` goto and
aborting the child process.
CPython devs acknowledge the fragility with a
`// Ideally we could guarantee tstate is running main.`
comment right above the call site.
See
`ai/conc-anal/subint_fork_blocked_by_cpython_post_fork_issue.md`
for the full annotated walkthrough + upstream-report
draft.
Why we keep this stub
---------------------
- Documents the attempt in-tree so the next person who
has this idea finds the reason it doesn't work rather
than rediscovering the same CPython-level dead end.
- If CPython ever lifts the restriction (e.g., via a
force-destroy primitive or a hook that swaps tstate
to main pre-fork), this submodule's git history holds
the structural sketch of what a working impl would
look like.
'''
if not _has_subints:
raise RuntimeError(
f'The {"subint_fork"!r} spawn backend requires '
f'Python 3.14+.\n'
f'Current runtime: {sys.version}'
)
raise NotImplementedError(
'The `subint_fork` spawn backend is blocked at the '
'CPython level — `os.fork()` from a non-main '
'sub-interpreter is refused by '
'`PyOS_AfterFork_Child()` → '
'`_PyInterpreterState_DeleteExceptMain()`, which '
'aborts the child with '
'`Fatal Python error: not main interpreter`.\n'
'\n'
'See '
'`ai/conc-anal/subint_fork_blocked_by_cpython_post_fork_issue.md` '
'for the full analysis + upstream-report draft.'
)

View File

@ -0,0 +1,687 @@
# tractor: structured concurrent "actors".
# Copyright 2018-eternity Tyler Goodlet.
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU Affero General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU Affero General Public License for more details.
# You should have received a copy of the GNU Affero General Public License
# along with this program. If not, see <https://www.gnu.org/licenses/>.
'''
Forkserver-style `os.fork()` primitives for the `subint`-hosted
actor model.
Background
----------
CPython refuses `os.fork()` from a non-main sub-interpreter:
`PyOS_AfterFork_Child()`
`_PyInterpreterState_DeleteExceptMain()` gates on the calling
thread's tstate belonging to the main interpreter and aborts
the forked child otherwise. The full walkthrough (with source
refs) lives in
`ai/conc-anal/subint_fork_blocked_by_cpython_post_fork_issue.md`.
However `os.fork()` from a regular `threading.Thread` attached
to the *main* interpreter i.e. a worker thread that has
never entered a subint works cleanly. Empirically validated
across four scenarios by
`ai/conc-anal/subint_fork_from_main_thread_smoketest.py` on
py3.14.
This submodule lifts the validated primitives out of the
smoke-test and into tractor proper, so they can eventually be
wired into a real "subint forkserver" spawn backend where:
- A dedicated main-interp worker thread owns all `os.fork()`
calls (never enters a subint).
- The tractor parent-actor's `trio.run()` lives in a
sub-interpreter on a different worker thread.
- When a spawn is requested, the trio-task signals the
forkserver thread; the forkserver forks; child re-enters
the same pattern (trio in a subint + forkserver on main).
This mirrors the stdlib `multiprocessing.forkserver` design
but keeps the forkserver in-process for faster spawn latency
and inherited parent state.
Status
------
**EXPERIMENTAL** wired as the `'subint_forkserver'` entry
in `tractor.spawn._spawn._methods` and selectable via
`try_set_start_method('subint_forkserver')` / `--spawn-backend
=subint_forkserver`. Parent-side spawn, child-side runtime
bring-up and normal portal-RPC teardown are validated by the
backend-tier test in
`tests/spawn/test_subint_forkserver.py::test_subint_forkserver_spawn_basic`.
Still-open work (tracked on tractor #379):
- no cancellation / hard-kill stress coverage yet (counterpart
to `tests/test_subint_cancellation.py` for the plain
`subint` backend),
- child-side "subint-hosted root runtime" mode (the second
half of the envisioned arch currently the forked child
runs plain `_trio_main` via `spawn_method='trio'`; the
subint-hosted variant is still the future step gated on
msgspec PEP 684 support),
- thread-hygiene audit of the two `threading.Thread`
primitives below, gated on the same msgspec unblock
(see TODO section further down).
TODO cleanup gated on msgspec PEP 684 support
-----------------------------------------------
Both primitives below allocate a dedicated
`threading.Thread` rather than using
`trio.to_thread.run_sync()`. That's a cautious design
rooted in three distinct-but-entangled issues (GIL
starvation from legacy-config subints, tstate-recycling
destroy race on trio cache threads, fork-from-main-tstate
invariant). Some of those dissolve under PEP 684
isolated-mode subints; one requires empirical re-testing
to know.
Full analysis + audit plan for when we can revisit is in
`ai/conc-anal/subint_forkserver_thread_constraints_on_pep684_issue.md`.
Intent: file a follow-up GH issue linked to #379 once
[jcrist/msgspec#563](https://github.com/jcrist/msgspec/issues/563)
unblocks isolated-mode subints in tractor.
See also
--------
- `tractor.spawn._subint_fork` the stub for the
fork-from-subint strategy that DIDN'T work (kept as
in-tree documentation of the attempt + CPython-level
block).
- `ai/conc-anal/subint_fork_blocked_by_cpython_post_fork_issue.md`
the CPython source walkthrough.
- `ai/conc-anal/subint_fork_from_main_thread_smoketest.py`
the standalone feasibility check (now delegates to
this module for the primitives it exercises).
'''
from __future__ import annotations
import os
import signal
import sys
import threading
from functools import partial
from typing import (
Any,
Callable,
TYPE_CHECKING,
)
import trio
from trio import TaskStatus
from tractor.log import get_logger
from tractor.msg import (
types as msgtypes,
pretty_struct,
)
from tractor.runtime._state import current_actor
from tractor.runtime._portal import Portal
from ._spawn import (
cancel_on_completion,
soft_kill,
)
if TYPE_CHECKING:
from tractor.discovery._addr import UnwrappedAddress
from tractor.ipc import (
_server,
)
from tractor.runtime._runtime import Actor
from tractor.runtime._supervise import ActorNursery
log = get_logger('tractor')
# Feature-gate: py3.14+ via the public `concurrent.interpreters`
# wrapper. Matches the gate in `tractor.spawn._subint` —
# see that module's docstring for why we require the public
# API's presence even though we reach into the private
# `_interpreters` C module for actual calls.
try:
from concurrent import interpreters as _public_interpreters # noqa: F401 # type: ignore
import _interpreters # type: ignore
_has_subints: bool = True
except ImportError:
_interpreters = None # type: ignore
_has_subints: bool = False
def _format_child_exit(
status: int,
) -> str:
'''
Render `os.waitpid()`-returned status as a short human
string (`'rc=0'` / `'signal=SIGABRT'` / etc.) for log
output.
'''
if os.WIFEXITED(status):
return f'rc={os.WEXITSTATUS(status)}'
elif os.WIFSIGNALED(status):
sig: int = os.WTERMSIG(status)
return f'signal={signal.Signals(sig).name}'
else:
return f'raw_status={status}'
def wait_child(
pid: int,
*,
expect_exit_ok: bool = True,
) -> tuple[bool, str]:
'''
`os.waitpid()` + classify the child's exit as
expected-or-not.
`expect_exit_ok=True` expect clean `rc=0`. `False`
expect abnormal death (any signal or nonzero rc). Used
by the control-case smoke-test scenario where CPython
is meant to abort the child.
Returns `(ok, status_str)` `ok` reflects whether the
observed outcome matches `expect_exit_ok`, `status_str`
is a short render of the actual status.
'''
_, status = os.waitpid(pid, 0)
exited_normally: bool = (
os.WIFEXITED(status)
and
os.WEXITSTATUS(status) == 0
)
ok: bool = (
exited_normally
if expect_exit_ok
else not exited_normally
)
return ok, _format_child_exit(status)
def fork_from_worker_thread(
child_target: Callable[[], int] | None = None,
*,
thread_name: str = 'subint-forkserver',
join_timeout: float = 10.0,
) -> int:
'''
`os.fork()` from a main-interp worker thread; return the
forked child's pid.
The calling context **must** be the main interpreter
(not a subinterpreter) that's the whole point of this
primitive. A regular `threading.Thread(target=...)`
spawned from main-interp code satisfies this
automatically because Python attaches the thread's
tstate to the *calling* interpreter, and our main
thread's calling interp is always main.
If `child_target` is provided, it runs IN the forked
child process before `os._exit` is called. The callable
should return an int used as the child's exit rc. If
`child_target` is None, the child `_exit(0)`s immediately
(useful for the baseline sanity case).
On the PARENT side, this function drives the worker
thread to completion (`fork()` returns near-instantly;
the thread is expected to exit promptly) and then
returns the forked child's pid. Raises `RuntimeError`
if the worker thread fails to return within
`join_timeout` seconds that'd be an unexpected CPython
pathology.
'''
if not _has_subints:
raise RuntimeError(
'subint-forkserver primitives require Python '
'3.14+ (public `concurrent.interpreters` module '
'not present on this runtime).'
)
# Use a pipe to shuttle the forked child's pid from the
# worker thread back to the caller.
rfd, wfd = os.pipe()
def _worker() -> None:
'''
Runs on the forkserver worker thread. Forks; child
runs `child_target` (if any) and exits; parent side
writes the child pid to the pipe so the main-thread
caller can retrieve it.
'''
pid: int = os.fork()
if pid == 0:
# CHILD: close the pid-pipe ends (we don't use
# them here), run the user callable if any, exit.
os.close(rfd)
os.close(wfd)
rc: int = 0
if child_target is not None:
try:
rc = child_target() or 0
except BaseException as err:
log.error(
f'subint-forkserver child_target '
f'raised:\n'
f'|_{type(err).__name__}: {err}'
)
rc = 2
os._exit(rc)
else:
# PARENT (still inside the worker thread):
# hand the child pid back to main via pipe.
os.write(wfd, pid.to_bytes(8, 'little'))
worker: threading.Thread = threading.Thread(
target=_worker,
name=thread_name,
daemon=False,
)
worker.start()
worker.join(timeout=join_timeout)
if worker.is_alive():
# Pipe cleanup best-effort before bail.
try:
os.close(rfd)
except OSError:
pass
try:
os.close(wfd)
except OSError:
pass
raise RuntimeError(
f'subint-forkserver worker thread '
f'{thread_name!r} did not return within '
f'{join_timeout}s — this is unexpected since '
f'`os.fork()` should return near-instantly on '
f'the parent side.'
)
pid_bytes: bytes = os.read(rfd, 8)
os.close(rfd)
os.close(wfd)
pid: int = int.from_bytes(pid_bytes, 'little')
log.runtime(
f'subint-forkserver forked child\n'
f'(>\n'
f' |_pid={pid}\n'
)
return pid
def run_subint_in_worker_thread(
bootstrap: str,
*,
thread_name: str = 'subint-trio',
join_timeout: float = 10.0,
) -> None:
'''
Create a fresh legacy-config sub-interpreter and drive
the given `bootstrap` code string through
`_interpreters.exec()` on a dedicated worker thread.
Naming mirrors `fork_from_worker_thread()`:
"<action>_in_worker_thread" the action here is "run a
subint", not "run trio" per se. Typical `bootstrap`
content does import `trio` + call `trio.run()`, but
nothing about this primitive requires trio; it's a
generic "host a subint on a worker thread" helper.
Intended mainly for use inside a fork-child (see
`tractor.spawn._subint_forkserver` module docstring) but
works anywhere.
See `tractor.spawn._subint.subint_proc` for the matching
pattern tractor uses at the sub-actor level.
Destroys the subint after the thread joins.
'''
if not _has_subints:
raise RuntimeError(
'subint-forkserver primitives require Python '
'3.14+.'
)
interp_id: int = _interpreters.create('legacy')
log.runtime(
f'Created child-side subint for trio.run()\n'
f'(>\n'
f' |_interp_id={interp_id}\n'
)
err: BaseException | None = None
def _drive() -> None:
nonlocal err
try:
_interpreters.exec(interp_id, bootstrap)
except BaseException as e:
err = e
worker: threading.Thread = threading.Thread(
target=_drive,
name=thread_name,
daemon=False,
)
worker.start()
worker.join(timeout=join_timeout)
try:
_interpreters.destroy(interp_id)
except _interpreters.InterpreterError as e:
log.warning(
f'Could not destroy child-side subint '
f'{interp_id}: {e}'
)
if worker.is_alive():
raise RuntimeError(
f'child-side subint trio-driver thread '
f'{thread_name!r} did not return within '
f'{join_timeout}s.'
)
if err is not None:
raise err
class _ForkedProc:
'''
Thin `trio.Process`-compatible shim around a raw OS pid
returned by `fork_from_worker_thread()`, exposing just
enough surface for the `soft_kill()` / hard-reap pattern
borrowed from `trio_proc()`.
Unlike `trio.Process`, we have no direct handles on the
child's std-streams (fork-without-exec inherits the
parent's FDs, but we don't marshal them into this
wrapper) `.stdin`/`.stdout`/`.stderr` are all `None`,
which matches what `soft_kill()` handles via its
`is not None` guards.
'''
def __init__(self, pid: int):
self.pid: int = pid
self._returncode: int | None = None
# `soft_kill`/`hard_kill` check these for pipe
# teardown — all None since we didn't wire up pipes
# on the fork-without-exec path.
self.stdin = None
self.stdout = None
self.stderr = None
def poll(self) -> int | None:
'''
Non-blocking liveness probe. Returns `None` if the
child is still running, else its exit code (negative
for signal-death, matching `subprocess.Popen`
convention).
'''
if self._returncode is not None:
return self._returncode
try:
waited_pid, status = os.waitpid(self.pid, os.WNOHANG)
except ChildProcessError:
# already reaped (or never existed) — treat as
# clean exit for polling purposes.
self._returncode = 0
return 0
if waited_pid == 0:
return None
self._returncode = self._parse_status(status)
return self._returncode
@property
def returncode(self) -> int | None:
return self._returncode
async def wait(self) -> int:
'''
Async blocking wait for the child's exit, off-loaded
to a trio cache thread so we don't block the event
loop on `waitpid()`. Safe to call multiple times;
subsequent calls return the cached rc without
re-issuing the syscall.
'''
if self._returncode is not None:
return self._returncode
_, status = await trio.to_thread.run_sync(
os.waitpid,
self.pid,
0,
abandon_on_cancel=False,
)
self._returncode = self._parse_status(status)
return self._returncode
def kill(self) -> None:
'''
OS-level `SIGKILL` to the child. Swallows
`ProcessLookupError` (already dead).
'''
try:
os.kill(self.pid, signal.SIGKILL)
except ProcessLookupError:
pass
def _parse_status(self, status: int) -> int:
if os.WIFEXITED(status):
return os.WEXITSTATUS(status)
elif os.WIFSIGNALED(status):
# negative rc by `subprocess.Popen` convention
return -os.WTERMSIG(status)
return 0
def __repr__(self) -> str:
return (
f'<_ForkedProc pid={self.pid} '
f'returncode={self._returncode}>'
)
async def subint_forkserver_proc(
name: str,
actor_nursery: ActorNursery,
subactor: Actor,
errors: dict[tuple[str, str], Exception],
# passed through to actor main
bind_addrs: list[UnwrappedAddress],
parent_addr: UnwrappedAddress,
_runtime_vars: dict[str, Any],
*,
infect_asyncio: bool = False,
task_status: TaskStatus[Portal] = trio.TASK_STATUS_IGNORED,
proc_kwargs: dict[str, any] = {},
) -> None:
'''
Spawn a subactor via `os.fork()` from a non-trio worker
thread (see `fork_from_worker_thread()`), with the forked
child running `tractor._child._actor_child_main()` and
connecting back via tractor's normal IPC handshake.
Supervision model mirrors `trio_proc()` we manage a
real OS subprocess, so `Portal.cancel_actor()` +
`soft_kill()` on graceful teardown and `os.kill(SIGKILL)`
on hard-reap both apply directly (no
`_interpreters.destroy()` voodoo needed since the child
is in its own process).
The only real difference from `trio_proc` is the spawn
mechanism: fork from a known-clean main-interp worker
thread instead of `trio.lowlevel.open_process()`.
'''
if not _has_subints:
raise RuntimeError(
f'The {"subint_forkserver"!r} spawn backend '
f'requires Python 3.14+.\n'
f'Current runtime: {sys.version}'
)
uid: tuple[str, str] = subactor.aid.uid
loglevel: str | None = subactor.loglevel
# Closure captured into the fork-child's memory image.
# In the child this is the first post-fork Python code to
# run, on what was the fork-worker thread in the parent.
def _child_target() -> int:
# Lazy import so the parent doesn't pay for it on
# every spawn — it's module-level in `_child` but
# cheap enough to re-resolve here.
from tractor._child import _actor_child_main
# XXX, fork inherits the parent's entire memory
# image — including `tractor.runtime._state` globals
# that encode "this process is the root actor":
#
# - `_runtime_vars['_is_root']` → True in parent
# - pre-populated `_root_mailbox`, `_registry_addrs`
# - the parent's `_current_actor` singleton
#
# A fresh `exec`-based child would start with the
# `_state` module's defaults (all falsey / empty).
# Replicate that here so the new child-side `Actor`
# sees a "cold" runtime — otherwise `Actor.__init__`
# takes the `is_root_process() == True` branch and
# pre-populates `self.enable_modules`, which then
# trips the `assert not self.enable_modules` gate at
# the top of `Actor._from_parent()` on the subsequent
# parent→child `SpawnSpec` handshake.
from tractor.runtime import _state
_state._current_actor = None
_state._runtime_vars.update({
'_is_root': False,
'_root_mailbox': (None, None),
'_root_addrs': [],
'_registry_addrs': [],
'_debug_mode': False,
})
_actor_child_main(
uid=uid,
loglevel=loglevel,
parent_addr=parent_addr,
infect_asyncio=infect_asyncio,
# NOTE, from the child-side runtime's POV it's
# a regular trio actor — it uses `_trio_main`,
# receives `SpawnSpec` over IPC, etc. The
# `subint_forkserver` name is a property of HOW
# the parent spawned, not of what the child is.
spawn_method='trio',
)
return 0
cancelled_during_spawn: bool = False
proc: _ForkedProc | None = None
ipc_server: _server.Server = actor_nursery._actor.ipc_server
try:
try:
pid: int = await trio.to_thread.run_sync(
partial(
fork_from_worker_thread,
_child_target,
thread_name=(
f'subint-forkserver[{name}]'
),
),
abandon_on_cancel=False,
)
proc = _ForkedProc(pid)
log.runtime(
f'Forked subactor via forkserver\n'
f'(>\n'
f' |_{proc}\n'
)
event, chan = await ipc_server.wait_for_peer(uid)
except trio.Cancelled:
cancelled_during_spawn = True
raise
assert proc is not None
portal = Portal(chan)
actor_nursery._children[uid] = (
subactor,
proc,
portal,
)
sspec = msgtypes.SpawnSpec(
_parent_main_data=subactor._parent_main_data,
enable_modules=subactor.enable_modules,
reg_addrs=subactor.reg_addrs,
bind_addrs=bind_addrs,
_runtime_vars=_runtime_vars,
)
log.runtime(
f'Sending spawn spec to forkserver child\n'
f'{{}}=> {chan.aid.reprol()!r}\n'
f'\n'
f'{pretty_struct.pformat(sspec)}\n'
)
await chan.send(sspec)
curr_actor: Actor = current_actor()
curr_actor._actoruid2nursery[uid] = actor_nursery
task_status.started(portal)
with trio.CancelScope(shield=True):
await actor_nursery._join_procs.wait()
async with trio.open_nursery() as nursery:
if portal in actor_nursery._cancel_after_result_on_exit:
nursery.start_soon(
cancel_on_completion,
portal,
subactor,
errors,
)
# reuse `trio_proc`'s soft-kill dance — `proc`
# is our `_ForkedProc` shim which implements the
# same `.poll()` / `.wait()` / `.kill()` surface
# `soft_kill` expects.
await soft_kill(
proc,
_ForkedProc.wait,
portal,
)
nursery.cancel_scope.cancel()
finally:
# Hard reap: SIGKILL + waitpid. Cheap since we have
# the real OS pid, unlike `subint_proc` which has to
# fuss with `_interpreters.destroy()` races.
if proc is not None and proc.poll() is None:
log.cancel(
f'Hard killing forkserver subactor\n'
f'>x)\n'
f' |_{proc}\n'
)
with trio.CancelScope(shield=True):
proc.kill()
await proc.wait()
if not cancelled_during_spawn:
actor_nursery._children.pop(uid, None)