The `subint_forkserver` backend's child runtime is trio-native (uses
`_trio_main` + receives `SpawnSpec` over IPC just like `trio`/`subint`),
so `tractor.devx.debug._tty_lock` works in those subactors. Wire the
runtime gates that historically hard-coded `_spawn_method == 'trio'` to
recognize this third backend.
Deats,
- new `_DEBUG_COMPATIBLE_BACKENDS` module-const in `tractor._root`
listing the spawn backends whose subactor runtime is trio-native
(`'trio'`, `'subint_forkserver'`). Both the enable-site
(`_runtime_vars['_debug_mode'] = True`) and the cleanup-site reset
key.
off the same tuple — keep them in lockstep when adding backends
- `open_root_actor`'s `RuntimeError` for unsupported backends now
reports the full compatible-set + the rejected method instead of the
stale "only `trio`" msg.
- `runtime._runtime.Actor._from_parent`'s SpawnSpec-recv gate adds
`'subint_forkserver'` to the existing `('trio', 'subint')` tuple
— fork child-side runtime receives the same SpawnSpec IPC handshake as
the others.
- `subint_forkserver_proc` child-target now passes
`spawn_method='subint_forkserver'` (was hard-coded `'trio'`) so
`Actor.pformat()` / log lines reflect the actual parent-side spawn
mechanism rather than masquerading as plain `trio`.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Empirical follow-up to the xfail'd orphan-SIGINT test:
the hang is **not** "trio can't install a handler on a
non-main thread" (the original hypothesis from the
`child_sigint` scaffold commit). On py3.14:
- `threading.current_thread() is threading.main_thread()`
IS True post-fork — CPython re-designates the
fork-inheriting thread as "main" correctly
- trio's `KIManager` SIGINT handler IS installed in the
subactor (`signal.getsignal(SIGINT)` confirms)
- the kernel DOES deliver SIGINT to the thread
But `faulthandler` dumps show the subactor wedged in
`trio/_core/_io_epoll.py::get_events` — trio's
wakeup-fd mechanism (which turns SIGINT into an epoll-wake)
isn't firing. So the `except KeyboardInterrupt` at
`tractor/spawn/_entry.py::_trio_main:164` — the runtime's
intentional "KBI-as-OS-cancel" path — never fires.
Deats,
- new `ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md`
(+385 LOC): full writeup — TL;DR, symptom reproducer,
the "intentional cancel path" the bug defeats,
diagnostic evidence (`faulthandler` output +
`getsignal` probe), ruled-out hypotheses
(non-main-thread issue, wakeup-fd inheritance,
KBI-as-trio-check-exception), and fix directions
- `test_orphaned_subactor_sigint_cleanup_DRAFT` xfail
`reason` + test docstring rewritten to match the
refined understanding — old wording blamed the
non-main-thread path, new wording points at the
`epoll_wait` wedge + cross-refs the new conc-anal doc
- `_subint_forkserver` module docstring's
`child_sigint='trio'` bullet updated: now notes trio's
handler is already correctly installed, so the flag may
end up a no-op / doc-only mode once the real root cause
is fixed
Closing the gap aligns with existing design intent (make
the already-designed "KBI-as-OS-cancel" behavior actually
fire), not a new feature.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Add configuration surface for future child-side SIGINT
plumbing in `subint_forkserver_proc` without wiring up the
actual trio-native SIGINT bridge — lifting one entry-guard
clause will flip the `'trio'` branch live once the
underlying fork-prelude plumbing is implemented.
Deats,
- new `ChildSigintMode = Literal['ipc', 'trio']` type +
`_DEFAULT_CHILD_SIGINT = 'ipc'` module-level default.
Docstring block enumerates both:
- `'ipc'` (default, currently the only implemented mode):
no child-side SIGINT handler — `trio.run()` is on the
fork-inherited non-main thread where
`signal.set_wakeup_fd()` is main-thread-only, so
cancellation flows exclusively via the parent's
`Portal.cancel_actor()` IPC path. Known gap: orphan
children don't respond to SIGINT
(`test_orphaned_subactor_sigint_cleanup_DRAFT`)
- `'trio'` (scaffolded only): manual SIGINT → trio-cancel
bridge in the fork-child prelude so external Ctrl-C
reaches stuck grandchildren even w/ a dead parent
- `subint_forkserver_proc` pulls `child_sigint` out of
`proc_kwargs` (matches how `trio_proc` threads config to
`open_process`, keeps `start_actor(proc_kwargs=...)` as
the ergonomic entry point); validates membership + raises
`NotImplementedError` for `'trio'` at the backend-entry
guard
- `_child_target` grows a `match child_sigint:` arm that
slots in the future `'trio'` impl without restructuring
— today only the `'ipc'` case is reachable
- module docstring "Still-open work" list grows a bullet
pointing at this config + the xfail'd orphan-SIGINT test
No behavioral change on the default path — `'ipc'` is the
existing flow. Scaffolding only.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Tier-4 test `test_orphaned_subactor_sigint_cleanup_DRAFT`
documents an empirical SIGINT-delivery gap in the
`subint_forkserver` backend: when the parent dies via
`SIGKILL` (no IPC `Portal.cancel_actor()` possible) and
`SIGINT` is sent to the orphan child, the child DOES NOT
unwind — CPython's default `KeyboardInterrupt` is delivered
to `threading.main_thread()`, whose tstate is dead in the
post-fork child bc fork inherited the worker thread, not
main. Trio running on the fork-inherited worker thread
therefore never observes the signal. Marked
`xfail(strict=True)` so the mark flips to XPASS→fail once
the backend grows explicit SIGINT plumbing.
Deats,
- harness runs the failure-mode sequence out-of-process:
1. harness subprocess runs a fresh Python script
that calls `try_set_start_method('subint_forkserver')`
then opens a root actor + one `sleep_forever` subactor
2. parse `PARENT_READY=<pid>` + `CHILD_PID=<pid>` markers
off harness `stdout` to confirm IPC handshake
completed
3. `SIGKILL` the parent, `proc.wait()` to reap the
zombie (otherwise `os.kill(pid, 0)` keeps reporting
it alive)
4. assert the child survived the parent-reap (i.e. was
actually orphaned, not reaped too) before moving on
5. `SIGINT` the orphan child, poll `os.kill(child_pid, 0)`
every 100ms for up to 10s
- supporting helpers: `_read_marker()` with per-proc
bytes-buffer to carry partial lines across calls,
`_process_alive()` liveness probe via `kill(pid, 0)`
- Linux-only via `platform.system() != 'Linux'` skip —
orphan-reparenting semantics don't generalize to
other platforms
- port offset (`reg_addr[1] + 17`) so the harness listener
doesn't race concurrently-running backend tests
- best-effort `finally:` cleanup: `SIGKILL` any still-alive
pids + `proc.kill()` + bounded `proc.wait()` to avoid
leaking orphans across the session
Also, tier-4 header comment documents the cross-backend
generalization path: applicable to any multi-process
backend (`trio`, `mp_spawn`, `mp_forkserver`,
`subint_forkserver`), NOT to plain `subint` (in-process
subints have no orphan OS-child). Move path: lift
harness into `tests/_orphan_harness.py`, parametrize on
session `_spawn_method`, add
`skipif _spawn_method == 'subint'`.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Post-fork `_runtime_vars` reset in `subint_forkserver_proc`
was previously done via direct mutation of
`_state._runtime_vars` from an external module + an inline
default dict duplicating the `_state.py`-internal defaults.
Split the access surface into a pure getter + explicit
setter so the reset call site becomes a one-liner
composition.
Deats `tractor/runtime/_state.py`,
- extract initial values into a module-level
`_RUNTIME_VARS_DEFAULTS: dict[str, Any]` constant; the
live `_runtime_vars` is now initialised from
`dict(_RUNTIME_VARS_DEFAULTS)`
- `get_runtime_vars()` grows a `clear_values: bool = False`
kwarg. When True, returns a fresh copy of
`_RUNTIME_VARS_DEFAULTS` instead of the live dict —
still a **pure read**, never mutates anything
- new `set_runtime_vars(rtvars: dict | RuntimeVars)` —
atomic replacement of the live dict's contents via
`.clear()` + `.update()`, so existing references to the
same dict object remain valid. Accepts either the
historical dict form or the `RuntimeVars` struct
Deats `tractor/spawn/_subint_forkserver.py`,
- collapse the prior ad-hoc `.update({...})` block into
`set_runtime_vars(get_runtime_vars(clear_values=True))`
- drop the `_state._current_actor = None` line —
`_trio_main` unconditionally overwrites it downstream,
so no explicit reset needed (noted in the XXX comment)
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
`os.fork()` inherits the parent's entire memory image,
including `tractor.runtime._state` globals that encode
"this process is the root actor" — `_runtime_vars`'s
`_is_root=True`, pre-populated `_root_mailbox` +
`_registry_addrs`, and the parent's `_current_actor`
singleton.
A fresh `exec`-based child starts with those globals at
their module-level defaults (all falsey/empty). The
forkserver child needs to match that shape BEFORE calling
`_actor_child_main()`, otherwise `Actor.__init__()` takes
the `is_root_process() == True` branch and pre-populates
`self.enable_modules`, which then trips
`assert not self.enable_modules` at the top of
`Actor._from_parent()` on the subsequent parent→child
`SpawnSpec` handshake.
Fix: at the start of `_child_target`, null
`_state._current_actor` and overwrite `_runtime_vars` with
a cold-root blank (`_is_root=False`, empty mailbox/addrs,
`_debug_mode=False`) before `_actor_child_main()` runs.
Found-via: `test_subint_forkserver_spawn_basic` hitting
the `enable_modules` assert on child-side runtime boot.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Promote `_subint_forkserver` from primitives-only into a
registered spawn backend: `'subint_forkserver'` is now a
`SpawnMethodKey` literal, dispatched via `_methods` to
the new `subint_forkserver_proc()` target, feature-gated
under the existing `subint`-family py3.14+ case, and
selectable via `--spawn-backend=subint_forkserver`.
Deats,
- new `subint_forkserver_proc()` spawn target in
`_subint_forkserver`:
- mirrors `trio_proc()`'s supervision model — real OS
subprocess so `Portal.cancel_actor()` + `soft_kill()`
on graceful teardown, `os.kill(SIGKILL)` on hard-reap
(no `_interpreters.destroy()` race to fuss over bc the
child lives in its own process)
- only real diff from `trio_proc` is the spawn mechanism:
fork from a main-interp worker thread via
`fork_from_worker_thread()` (off-loaded to trio's
thread pool) instead of `trio.lowlevel.open_process()`
- child-side `_child_target` closure runs
`tractor._child._actor_child_main()` with
`spawn_method='trio'` — the child is a regular trio
actor, "subint_forkserver" names how the parent
spawned, not what the child runs
- new `_ForkedProc` class — thin `trio.Process`-compatible
shim around a raw OS pid: `.poll()` via
`waitpid(WNOHANG)`, async `.wait()` off-loaded to a trio
cache thread, `.kill()` via `SIGKILL`, `.returncode`
cached for repeat calls. `.stdin`/`.stdout`/`.stderr`
are `None` (fork-w/o-exec inherits parent FDs; we don't
marshal them) which matches `soft_kill()`'s `is not None`
guards
Also, new backend-tier test
`test_subint_forkserver_spawn_basic` drives the registered
backend end-to-end via `open_root_actor` + `open_nursery` +
`run_in_actor` w/ a trivial portal-RPC round-trip. Uses a
`forkserver_spawn_method` fixture to flip
`_spawn_method`/`_ctx` for the test's duration + restore on
teardown (so other session-level tests don't observe the
global flip). Test module docstring reworked to describe
the three tiers now covered: (1) primitive-level, (2)
parent-trio-driven primitives, (3) full registered backend.
Status: still-open work (tracked on `tractor#379`) doc'd
inline in the module docstring — no cancel/hard-kill stress
coverage yet, child-side subint-hosted root runtime still
future (gated on `msgspec#563`), thread-hygiene audit
pending the same unblock.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Follow-up tracker companion to the module-docstring TODO
added in `372a0f32`. Catalogs why `_subint_forkserver`'s
two "non-trio thread" constraints
(`fork_from_worker_thread()` +
`run_subint_in_worker_thread()` both allocating dedicated
`threading.Thread`s; test helper named
`run_fork_in_non_trio_thread`) exist today, and which of
them would dissolve once msgspec PEP 684 support ships
(`msgspec#563`) and tractor flips to isolated-mode subints.
Deats,
- three reasons enumerated for the current constraints:
- class-A GIL-starvation — **fixed** by isolated mode:
subints don't share main's GIL so abandoned-thread
contention disappears
- destroy race / tstate-recycling from `subint_proc` —
**unclear**: `_PyXI_Enter` + `_PyXI_Exit` are
cross-mode, so isolated doesn't obviously fix it;
needs empirical retest on py3.14 + isolated API
- fork-from-main-interp-tstate (the CPython-level
`_PyInterpreterState_DeleteExceptMain` gate) — the
narrow reason for using a dedicated thread; **probably
fixed** IF the destroy-race also resolves (bc trio's
cache threads never drove subints → clean main-interp
tstate)
- TL;DR table of which constraints unwind under each
resolution branch
- four-step audit plan for when `msgspec#563` lands:
- flip `_subint` to isolated mode
- empirical destroy-race retest
- audit `_subint_forkserver.py` — drop `non_trio`
qualifier / maybe inline primitives
- doc fallout — close the three `subint_*_issue.md`
siblings w/ post-mortem notes
Also, cross-refs the three sibling `conc-anal/` docs, PEPs
684 + 734, `msgspec#563`, and `tractor#379` (the overall
subint spawn-backend tracking issue).
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
New pytest module `tests/spawn/test_subint_forkserver.py`
drives the forkserver primitives from inside a real
`trio.run()` in the parent — the runtime shape tractor will
actually use when we wire up a `subint_forkserver` spawn
backend proper. Complements the standalone no-trio-in-parent
`ai/conc-anal/subint_fork_from_main_thread_smoketest.py`.
Deats,
- new test pkg `tests/spawn/` (+ empty `__init__.py`)
- two tests, both `@pytest.mark.timeout(30, method='thread')`
for the GIL-hostage safety reason doc'd in
`ai/conc-anal/subint_sigint_starvation_issue.md`:
- `test_fork_from_worker_thread_via_trio` — parent-side
plumbing baseline. `trio.run()` off-loads forkserver
prims via `trio.to_thread.run_sync()` + asserts the
child reaps cleanly
- `test_fork_and_run_trio_in_child` — end-to-end: forked
child calls `run_subint_in_worker_thread()` with a
bootstrap str that does `trio.run()` in a fresh subint
- both tests wrap the inner `trio.run()` in a
`dump_on_hang()` for post-mortem if the outer
`pytest-timeout` fires
- intentionally NOT using `--spawn-backend` — the tests
drive the primitives directly rather than going through
tractor's spawn-method registry (which the forkserver
isn't plugged into yet)
Also, rename `run_trio_in_subint()` →
`run_subint_in_worker_thread()` for naming consistency with
the sibling `fork_from_worker_thread()`. The action is really
"host a subint on a worker thread", not specifically "run
trio" — trio just happens to be the typical payload.
Propagate the rename to the smoketest.
Further, add a "TODO — cleanup gated on msgspec PEP 684
support" section to the `_subint_forkserver` module
docstring: flags the dedicated-`threading.Thread` design as
potentially-revisable once isolated-mode subints are viable
in tractor. Cross-refs `msgspec#563` + `tractor#379` and
points at an audit-plan conc-anal doc we'll add next.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
The smoketest (prior commit) empirically validated the
"fork-from-main-interp-worker-thread" arch on py3.14. Promote
the validated primitives out of the `ai/conc-anal/` smoketest
into `tractor.spawn._subint_forkserver` so they can eventually
be wired into a real "subint forkserver" spawn backend.
Deats,
- new module `tractor/spawn/_subint_forkserver.py` (337 LOC):
- `fork_from_worker_thread(child_target, thread_name)` —
spawn a main-interp `threading.Thread`, call `os.fork()`
from it, shuttle the child pid back to main via a pipe
- `run_trio_in_subint(bootstrap, ...)` — post-fork helper:
create a fresh subint + drive `_interpreters.exec()` on
a dedicated worker thread running the `bootstrap` str
(typically imports `trio`, defines an async entry, calls
`trio.run()`)
- `wait_child(pid, expect_exit_ok)` — `os.waitpid()` +
pass/fail classification reusable from harness AND the
eventual real spawn path
- feature-gated py3.14+ via the public
`concurrent.interpreters` presence check; matches the gate
in `tractor.spawn._subint`
- module docstring doc's the CPython-block context
(cross-refs `_subint_fork` stub + the two `conc-anal/`
docs) and status: EXPERIMENTAL, not yet registered in
`_spawn._methods`
Also, refactor the smoketest
`ai/conc-anal/subint_fork_from_main_thread_smoketest.py` to
import the primitives from the new module rather than inline
its own copies. Keeps the smoketest and the tractor-side
impl in sync as the forkserver design evolves; the smoketest
remains a zero-`tractor`-runtime CPython-level check
(imports ONLY the three primitives, no runtime bring-up).
Status: next step is to drive these from a parent-side
`trio.run()` and hook the returned child pid into the normal
actor-nursery/IPC flow — then register `subint_forkserver`
as a `SpawnMethodKey` in `_spawn.py`.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Standalone script to validate the "main-interp worker-thread
forkserver + subint-hosted trio" arch proposed as a workaround
to the CPython-level refusal doc'd in
`ai/conc-anal/subint_fork_blocked_by_cpython_post_fork_issue.md`.
Deliberately NOT a `tractor` test — zero `tractor` imports.
Uses `_interpreters` (private stdlib) + `os.fork()` directly so
pass/fail is a property of CPython alone, independent of our
runtime. Requires py3.14+.
Deats,
- four scenarios via `--scenario`:
- `control_subint_thread_fork` — the KNOWN-BROKEN case as a
harness sanity; if the child DOESN'T abort, our analysis
is wrong
- `main_thread_fork` — baseline sanity, must always succeed
- `worker_thread_fork` — architectural assertion: regular
`threading.Thread` attached to main interp calls
`os.fork()`; child should survive post-fork cleanup
- `full_architecture` — end-to-end: fork from a main-interp
worker thread, then in child create a subint driving a
worker thread running `trio.run()`
- exit code 0 on EXPECTED outcome (for `control_*` that means
"child aborted", not "child succeeded")
- each scenario prints a self-contained pass/fail banner; use
`os.waitpid()` of the parent + per-scenario status prints to
observe the child's fate
Also, log NLNet provenance for this session's three-sub-phase
work (py3.13 gate tightening, `pytest-timeout` + marker
refactor, `subint_fork` prototype → CPython-block finding).
Prompt-IO: ai/prompt-io/claude/20260422T200723Z_797f57c_prompt_io.md
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Empirical finding: the WIP `subint_fork_proc` scaffold
landed in `cf0e3e6f` does *not* work on current CPython.
The `fork()` syscall succeeds in the parent, but the
CHILD aborts immediately during
`PyOS_AfterFork_Child()` →
`_PyInterpreterState_DeleteExceptMain()`, which gates
on the current tstate belonging to the main interp —
the child dies with `Fatal Python error: not main
interpreter`.
CPython devs acknowledge the fragility with an in-source
comment (`// Ideally we could guarantee tstate is running
main.`) but expose no user-facing hook to satisfy the
precondition — so the strategy is structurally dead until
upstream changes.
Rather than delete the scaffold, reshape it into a
documented dead-end so the next person with this idea
lands on the reason rather than rediscovering the same
CPython-level refusal.
Deats,
- Move `subint_fork_proc` out of `tractor.spawn._subint`
into a new `tractor.spawn._subint_fork` dedicated
module (153 LOC). Module + fn docstrings now describe
the blockage directly; the fn body is trimmed to a
`NotImplementedError` pointing at the analysis doc —
no more dead-code `bootstrap` sketch bloating
`_subint.py`.
- `_spawn.py`: keep `'subint_fork'` in `SpawnMethodKey`
+ the `_methods` dispatch so
`--spawn-backend=subint_fork` routes to a clean
`NotImplementedError` rather than "invalid backend";
comment calls out the blockage. Collapse the duplicate
py3.14 feature-gate in `try_set_start_method()` into a
combined `case 'subint' | 'subint_fork':` arm.
- New 337-line analysis:
`ai/conc-anal/subint_fork_blocked_by_cpython_post_fork_issue.md`.
Annotated walkthrough from the user-visible fatal
error down to the specific `Modules/posixmodule.c` +
`Python/pystate.c` source lines enforcing the refusal,
plus an upstream-report draft.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Experimental third spawn backend: use a fresh
sub-interpreter purely as a trio-free launchpad from
which to `os.fork()` + exec back into
`python -m tractor._child`. Per issue #379's
"fork()-workaround/hacks" thread.
Intent is to sidestep both,
- the trio+fork hazards hitting `trio_proc` (python- trio/trio#1614 et
al.), since the forking interp is guaranteed trio-free.
- the shared-GIL abandoned-thread hazards hitting `subint_proc`
(`ai/conc-anal/subint_sigint_starvation_issue.md`), since we don't
*stay* in the subint — it only lives long enough to call `os.fork()`
Downstream of the fork+exec, all the existing `trio_proc` plumbing is
reused verbatim: `ipc_server.wait_for_peer()`, `SpawnSpec`, `Portal`
yield, soft-kill.
Status: NOT wired up beyond scaffolding. The fn raises
`NotImplementedError` immediately; the `bootstrap` fork/exec string
builder and the `# TODO: orchestrate driver thread` block are kept
in-tree as deliberate dead code so the next iteration starts from
a concrete shape rather than a blank page.
Docstring calls out three open questions that need
empirical validation before wiring this up:
1. Does CPython permit `os.fork()` from a non-main
legacy subint?
2. Can the child stay fork-without-exec and
`trio.run()` directly from within the launchpad
subint?
3. How do `signal.set_wakeup_fd()` handlers and other
process-global state interact when the forking
thread is inside a subint?
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Adopt the `@pytest.mark.skipon_spawn_backend('subint',
reason=...)` marker (a617b521) across the suites
reproducing the `subint` GIL-contention / starvation
hang classes doc'd in `ai/conc-anal/subint_*_issue.md`.
Deats,
- Module-level `pytestmark` on full-file-hanging suites:
- `tests/test_cancellation.py`
- `tests/test_inter_peer_cancellation.py`
- `tests/test_pubsub.py`
- `tests/test_shm.py`
- Per-test decorator where only one test in the file
hangs:
- `tests/discovery/test_registrar.py
::test_stale_entry_is_deleted` — replaces the
inline `if start_method == 'subint': pytest.skip`
branch with a declarative skip.
- `tests/test_subint_cancellation.py
::test_subint_non_checkpointing_child`.
- A few per-test decorators are left commented-in-
place as breadcrumbs for later finer-grained unskips.
Also, some nearby tidying in the affected files:
- Annotate loose fixture / test params
(`pytest.FixtureRequest`, `str`, `tuple`, `bool`) in
`tests/conftest.py`, `tests/devx/conftest.py`, and
`tests/test_cancellation.py`.
- Normalize `"""..."""` → `'''...'''` docstrings per
repo convention on a few touched tests.
- Add `timeout=6` / `timeout=10` to
`@tractor_test(...)` on `test_cancel_infinite_streamer`
and `test_some_cancels_all`.
- Drop redundant `spawn_backend` param from
`test_cancel_via_SIGINT`; use `start_method` in the
`'mp' in ...` check instead.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
A reusable `@pytest.mark.skipon_spawn_backend( '<backend>' [, ...],
reason='...')` marker for backend-specific known-hang / -borked cases
— avoids scattering `@pytest.mark.skipif(lambda ...)` branches across
tests that misbehave under a particular `--spawn-backend`.
Deats,
- `pytest_configure()` registers the marker via
`addinivalue_line('markers', ...)`.
- New `pytest_collection_modifyitems()` hook walks
each collected item with `item.iter_markers(
name='skipon_spawn_backend')`, checks whether the
active `--spawn-backend` appears in `mark.args`, and
if so injects a concrete `pytest.mark.skip(
reason=...)`. `iter_markers()` makes the decorator
work at function, class, or module (`pytestmark =
[...]`) scope transparently.
- First matching mark wins; default reason is
`f'Borked on --spawn-backend={backend!r}'` if the
caller doesn't supply one.
Also, tighten type annotations on nearby `pytest`
integration points — `pytest_configure`, `debug_mode`,
`spawn_backend`, `tpt_protos`, `tpt_proto` — now taking
typed `pytest.Config` / `pytest.FixtureRequest` params.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Add two more tests to the catalog in
`conc-anal/subint_sigint_starvation_issue.md` — same
signal-wakeup-fd-saturation fingerprint (abandoned legacy-subint driver
threads → shared-GIL starvation → `write() = EAGAIN` on the wakeup pipe
→ silent SIGINT drop), different load patterns.
Deats,
- `test_cancel_while_childs_child_in_sync_sleep[subint-False]`: nested
actor-tree + sync-sleeping grandchild. Under `trio`/`mp_*` the "zombie
reaper" is a subproc `SIGKILL`; no equivalent exists under subint, so
the grandchild persists in its abandoned driver thread. Often only
manifests under full-suite runs (earlier tests seed the
abandoned-thread pool).
- `test_multierror_fast_nursery[subint-25-0.5]`: 25 concurrent subactors
all go through teardown on the multierror. Bounded hard-kills run in
parallel — so the total budget is ~3s, not 3s × 25. Leaves 25
abandoned driver threads simultaneously alive, an extreme pressure
multiplier. `strace` shows several successful `write(16, "\2", 1) = 1`
(GIL round-robin IS giving main brief slices) before finally
saturating with `EAGAIN`.
Also include a `pstree -snapt <pid>` capture showing
16+ live `{subint-driver[<interp_id>}` threads at the
moment of hang — the direct GIL-contender population.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Add a hard process-level wall-clock bound on the two
known-hanging subint-backend tests so an unattended
suite run can't wedge indefinitely in either of the
hang classes doc'd in `ai/conc-anal/`.
Deats,
- New `testing` dep: `pytest-timeout>=2.3`.
- `test_stale_entry_is_deleted`:
`@pytest.mark.timeout(3, method='thread')`. The
`method='thread'` choice is deliberate —
`method='signal'` routes via `SIGALRM` which is
starved by the same GIL-hostage path that drops
`SIGINT` (see `subint_sigint_starvation_issue.md`),
so it'd never actually fire in the starvation case.
- `test_subint_non_checkpointing_child`: same
decorator, same reasoning (defense-in-depth over
the inner `trio.fail_after(15)`).
At timeout, `pytest-timeout` hard-kills the pytest
process itself — that's the intended behavior here;
the alternative is the suite never returning.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Log the `claude-opus-4-7` collab that produced `e92e3cd2` ("Doc `subint`
backend hang classes + arm `dump_on_hang`"). Substantive bc the two new
`ai/conc-anal/` docs were jointly authored — user framed the two-class
split + set candidate-fix ordering for the class-2 (Ctrl-C-able) hang;
claude drafted the prose and the test-side cross-linking comments.
`.raw.md` is in diff-ref mode — per-file pointers via `git diff
e92e3cd2~1..e92e3cd2 -- <path>` rather than re-embedding content that
already lives in `git log -p`.
Prompt-IO: ai/prompt-io/claude/20260420T192739Z_5e8cd8b2_prompt_io.md
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Classify and write up the two distinct hang modes hit during Phase
B subint bringup (issue #379) so future triage doesn't re-derive them
from scratch.
Deats, two new `ai/conc-anal/` docs,
- `subint_sigint_starvation_issue.md`: abandoned legacy-subint thread
+ shared GIL → main trio loop starves → signal-wakeup-fd pipe fills
→ `SIGINT` silently dropped (`strace` shows `write() = EAGAIN` on the
wakeup-fd). Un- Ctrl-C-able. Structurally a CPython limit; blocked on
`msgspec` PEP 684 (jcrist/msgspec#563)
- `subint_cancel_delivery_hang_issue.md`: parent-side trio task parks on
an orphaned IPC channel after subint teardown — no clean EOF delivered
to the waiting receive. Ctrl-C-able (main loop iterates fine); OUR bug
to fix. Candidate fix: explicit parent-side channel abort in
`subint_proc`'s hard-kill teardown
Cross-link the docs from their test reproducers,
- `test_stale_entry_is_deleted` (→ starvation class): wrap
`trio.run(main)` in `dump_on_hang(seconds=20)` so a future regression
captures a stack dump. Kept un- skipped so the dump file is
inspectable
- `test_subint_non_checkpointing_child` (→ delivery class): extend
docstring with a "KNOWN ISSUE" block pointing at the analysis
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Lock in the escape-hatch machinery added to `tractor.spawn._subint`
during the Phase B.2/B.3 bringup (issue #379) so future stdlib
regressions or our own refactors don't silently re-introduce the
mid-suite hangs.
Deats,
- `test_subint_happy_teardown`: baseline — spawn a subactor, one portal
RPC, clean teardown. If this breaks, something's wrong unrelated to
the hard-kill shields.
- `test_subint_non_checkpointing_child`: cancel a subactor stuck in
a non-checkpointing Python loop (`threading.Event.wait()` releases the
GIL but never inserts a trio checkpoint). Validates the bounded-shield
+ daemon-driver-thread combo abandons the thread after
`_HARD_KILL_TIMEOUT`.
Every test is wrapped in `trio.fail_after()` for a deterministic
per-test wall-clock ceiling (an unbounded audit would defeat itself) and
arms `tractor.devx.dump_on_hang()` so a hang captures a stack dump
— pytest's stderr capture swallows `faulthandler` output by default.
Gated via `pytest.importorskip('concurrent.interpreters')` and
a module-level skip when `--spawn-backend` isn't `'subint'`.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
The private `_interpreters` C module ships since 3.13, but that vintage
wedges under our `threading.Thread` + multi-trio usage pattern
—> `_interpreters.exec()` silently never makes progress. 3.14 fixes it.
So gate on the presence of the public `concurrent.interpreters` wrapper
(3.14+ only) even tho we still call into the private module at runtime.
Deats,
- `try_set_start_method('subint')` error msg + `_subint` module
docstring/comments rewritten to document the 3.14 floor and why 3.13
can't work.
- `_subint._has_subints` gate now imports `concurrent.interpreters` (not
`_interpreters`) as the version sentinel.
Also, reshuffle `pyproject.toml` deps into
per-python-version `[tool.uv.dependency-groups]`:
- `subints` group: `msgspec>=0.21.0`, py>=3.14
- `eventfd` group: `cffi>=1.17.1`, py>=3.13,<3.14
- `sync_pause` group: `greenback`, py>=3.13,<3.14
(was in `devx`; moved out bc no 3.14 yet)
Bump top-level `msgspec>=0.20.0` too.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Bottle up the diagnostic primitives that actually cracked the
silent mid-suite hangs in the `subint` spawn-backend bringup (issue
there" session has them on the shelf instead of reinventing from
scratch.
Deats,
- `dump_on_hang(seconds, *, path)` — context manager wrapping
`faulthandler.dump_traceback_later()`. Critical gotcha baked in:
dumps go to a *file*, not `sys.stderr`, bc pytest's stderr
capture silently eats the output and you can spend an hour
convinced you're looking at the wrong thing
- `track_resource_deltas(label, *, writer)` — context manager
logging per-block `(threading.active_count(),
len(_interpreters.list_all()))` deltas; quickly rules out
leak-accumulation theories when a suite progressively worsens (if
counts don't grow, it's not a leak, look for a race on shared
cleanup instead)
- `resource_delta_fixture(*, autouse, writer)` — factory returning
a `pytest` fixture wrapping `track_resource_deltas` per-test; opt
in by importing into a `conftest.py`. Kept as a factory (not a
bare fixture) so callers own `autouse` / `writer` wiring
Also,
- export the three names from `tractor.devx`
- dep-free on py<3.13 (swallows `ImportError` for `_interpreters`)
- link back to the provenance in the module docstring (issue #379 /
commit `26fb820`)
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Unbounded `trio.CancelScope(shield=True)` at the
soft-kill and thread-join sites can wedge the parent
trio loop indefinitely when a stuck subint ignores
portal-cancel (e.g. bc the IPC channel is already
broken).
Deats,
- add `_HARD_KILL_TIMEOUT` (3s) module-level const
- wrap both shield sites with
`trio.move_on_after()` so we abandon a stuck
subint after the deadline
- flip driver thread to `daemon=True` so proc-exit
also isn't blocked by a wedged subint
- pass `abandon_on_cancel=True` to
`trio.to_thread.run_sync(driver_thread.join)`
— load-bearing for `move_on_after` to actually
fire
- log warnings when either timeout triggers
- improve `InterpreterError` log msg to explain
the abandoned-thread scenario
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Log the `claude-opus-4-7` session that produced
the `_subint.py` dedicated-thread fix (`26fb8206`).
Substantive bc the patch was entirely AI-generated;
raw log also preserves the CPython-internals
research informing Phase B.3 hard-kill work.
Prompt-IO: ai/prompt-io/claude/20260418T042526Z_26fb820_prompt_io.md
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
`trio.to_thread.run_sync(_interpreters.exec, ...)` runs `exec()` on
a cached worker thread — and when that thread is returned to the
cache after the subint's `trio.run()` exits, CPython still keeps
the subint's tstate attached to the (now idle) worker. Result: the
teardown `_interpreters.destroy(interp_id)` in the `finally` block
can block the parent's trio loop indefinitely, waiting for a tstate
release that only happens when the worker either picks up a new job
or exits.
Manifested as intermittent mid-suite hangs under
`--spawn-backend=subint` — caught by a
`faulthandler.dump_traceback_later()` showing the main thread stuck
in `_interpreters.destroy()` at `_subint.py:293` with only an idle
trio-cache worker as the other live thread.
Deats,
- drive the subint on a plain `threading.Thread` (not
`trio.to_thread`) so the OS thread truly exits after
`_interpreters.exec()` returns, releasing tstate and unblocking
destroy
- signal `subint_exited.set()` back to the parent trio loop from
the driver thread via `trio.from_thread.run_sync(...,
trio_token=...)` — capture the token at `subint_proc` entry
- swallow `trio.RunFinishedError` in that signal path for the case
where parent trio has already exited (proc teardown)
- in the teardown `finally`, off-load the sync
`driver_thread.join()` to `trio.to_thread.run_sync` (cache thread
w/ no subint tstate → safe) so we actually wait for the driver to
exit before `_interpreters.destroy()`
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Expand the comment block above the `_interpreters`
import explaining *why* we use the private C mod
over `concurrent.interpreters`: the public API only
exposes PEP 734's `'isolated'` config which breaks
`msgspec` (missing PEP 684 slot). Add reference
links to PEP 734, PEP 684, cpython sources, and
the msgspec upstream tracker (jcrist/msgspec#563).
Also,
- update error msgs in both `_spawn.py` and
`_subint.py` to say "3.13+" (matching the actual
`_interpreters` availability) instead of "3.14+".
- tweak the mod docstring to reflect py3.13+
availability via the private C module.
Review: PR #444 (copilot-pull-request-reviewer)
https://github.com/goodboy/tractor/pull/444
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Replace the B.1 scaffold stub w/ a working spawn
flow driving PEP 734 sub-interpreters on dedicated
OS threads.
Deats,
- use private `_interpreters` C mod (not the public
`concurrent.interpreters` API) to get `'legacy'`
subint config — avoids PEP 684 C-ext compat
issues w/ `msgspec` and other deps missing the
`Py_mod_multiple_interpreters` slot
- bootstrap subint via code-string calling new
`_actor_child_main()` from `_child.py` (shared
entry for both CLI and subint backends)
- drive subint lifetime on an OS thread using
`trio.to_thread.run_sync(_interpreters.exec, ..)`
- full supervision lifecycle mirrors `trio_proc`:
`ipc_server.wait_for_peer()` → send `SpawnSpec`
→ yield `Portal` via `task_status.started()`
- graceful shutdown awaits the subint's inner
`trio.run()` completing; cancel path sends
`portal.cancel_actor()` then waits for thread
join before `_interpreters.destroy()`
Also,
- extract `_actor_child_main()` from `_child.py`
`__main__` block as callable entry shape bc the
subint needs it for code-string bootstrap
- add `"subint"` to the `_runtime.py` spawn-method
check so child accepts `SpawnSpec` over IPC
Prompt-IO: ai/prompt-io/claude/20260417T124437Z_5cd6df5_prompt_io.md
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Since we're devving subints we require the 3.14+ stdlib API
and a couple compiled libs don't support it yet, namely:
- `cffi`, which we're only using for the `.ipc._linux` eventfd
stuff (now factored into `hotbaud` anyway).
- `greenback`, which requires `greenlet` which doesn't seem to be
wheeled yet
* on nixos the sdist build was failing due to lack of `g++` which
i don't care to figure out rn since we don't need `.devx` stuff
immediately for this subints prototype.
* [ ] we still need to adjust any dependent suites to skip.
Adjust `test_ringbuf` to skip on import failure.
Also project wide,
- pin us to py 3.13+ in prep for last-2-minor-version policy.
- drop `msgspec>=0.20.0`, the first release with py3.14 support.
Land the scaffolding for a future sub-interpreter (PEP 734
`concurrent.interpreters`) actor spawn backend per issue #379. The
spawn flow itself is not yet implemented; `subint_proc()` raises a
placeholder `NotImplementedError` pointing at the tracking issue —
this commit only wires up the registry, the py-version gate, and
the harness.
Deats,
- bump `pyproject.toml` `requires-python` to `>=3.12, <3.15` and
list the `3.14` classifier — the new stdlib
`concurrent.interpreters` module only ships on 3.14
- extend `SpawnMethodKey = Literal[..., 'subint']`
- `try_set_start_method('subint')` grows a new `match` arm that
feature-detects the stdlib module and raises `RuntimeError` with
a clear banner on py<3.14
- `_methods` registers the new `subint_proc()` via the same
bottom-of-module late-import pattern used for `._trio` / `._mp`
Also,
- new `tractor/spawn/_subint.py` — top-level `try: from concurrent
import interpreters` guards `_has_subints: bool`; `subint_proc()`
signature mirrors `trio_proc`/`mp_proc` so the Phase B.2 impl can
drop in without touching the registry
- re-add `import sys` to `_spawn.py` (needed for the py-version msg
in the gate-error)
- `_testing.pytest.pytest_configure` wraps `try_set_start_method()`
in a `pytest.UsageError` handler so `--spawn-backend=subint` on
py<3.14 prints a clean banner instead of a traceback
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Rework section 3 from a worktree-only check into a
structured 3-step flow: detect active venv, interpret
results (Case A: active, B: none, C: worktree), then
run import + collection checks.
Deats,
- Case B prompts via `AskUserQuestion` when no venv
is detected, offering `uv sync` or manual activate
- add `uv run` fallback section for envs where venv
activation isn't practical
- new allowed-tools: `uv run python`, `uv run pytest`,
`uv pip show`, `AskUserQuestion`
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
New "Inspect last failures" section reads the pytest
`lastfailed` cache JSON directly — instant, no
collection overhead, and filters to `tests/`-prefixed
entries to avoid stale junk paths.
Also,
- add `jq` tool permission for `.pytest_cache/` files
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Group `.claude/` ignores per-skill instead of a
flat list: `ai.skillz` symlinks, `/open-wkt`,
`/code-review-changes`, `/pr-msg`, `/commit-msg`.
Add missing symlink entries (`yt-url-lookup` ->
`resolve-conflicts`, `inter-skill-review`). Drop
stale `Claude worktrees` section (already covered
by `.claude/wkts/`).
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Drop unused `TYPE_CHECKING` imports (`Channel`,
`_server`), remove commented-out `import os` in
`_entry.py`, and use `get_runtime_vars()` accessor
instead of bare `_runtime_vars` in `_trio.py`.
Also,
- freshen `__init__.py` layout docstring for the
new per-backend submod structure
- update `_spawn.py` + `_trio.py` module docstrings
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Split the monolithic `spawn._spawn` into a slim
"core" + per-backend submodules so a future
`._subint` backend (per issue #379) can drop in
without piling more onto `_spawn.py`.
`._spawn` retains the cross-backend supervisor
machinery: `SpawnMethodKey`, `_methods` registry,
`_spawn_method`/`_ctx` state, `try_set_start_method()`,
the `new_proc()` dispatcher, and the shared helpers
`exhaust_portal()`, `cancel_on_completion()`,
`hard_kill()`, `soft_kill()`, `proc_waiter()`.
Deats,
- mv `trio_proc()` → new `spawn._trio`
- mv `mp_proc()` → new `spawn._mp`, reads `_ctx` and
`_spawn_method` via `from . import _spawn` for
late binding bc both get mutated by
`try_set_start_method()`
- `_methods` wires up the new submods via late
bottom-of-module imports to side-step circular
dep (both backend mods pull shared helpers from
`._spawn`)
- prune now-unused imports from `_spawn.py` — `sys`,
`is_root_process`, `current_actor`,
`is_main_process`, `_mp_main`, `ActorFailure`,
`pretty_struct`, `_pformat`
Also,
- `_testing.pytest.pytest_generate_tests()` now
drives the valid-backend set from
`typing.get_args(SpawnMethodKey)` so adding a
new backend (e.g. `'subint'`) doesn't require
touching the harness
- refresh `spawn/__init__.py` docstring for the
new layout
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Log the `claude-opus-4-7` design session that produced the phased plan
(A: modularize `_spawn`, B: `_subint` backend, C: harness) and concrete
Phase A file-split for #379. Substantive bc the plan directly drives
upcoming impl.
Prompt-IO: ai/prompt-io/claude/20260417T034918Z_9703210_prompt_io.md
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Import and apply `cpu_scaling_factor()` from
`conftest`; bump base from 3.6 -> 4 and multiply
through so CI boxes with slow CPUs don't flake.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code