14 KiB
Session-spanning conversation covering the Phase B hardening of the subint spawn-backend and an investigation into a proposed subint_fork follow-up which turned out to be blocked at the CPython level. This log is a narrative capture of the substantive turns (not every message) and references the concrete code + docs the session produced. Per diff-ref mode the actual code diffs are pointed at via git log on each ref rather than duplicated inline.
Narrative of the substantive turns
Py3.13 hang / gate tightening
Diagnosed a reproducible hang of the subint backend under py3.13 (test_spawning tests wedge after root-actor bringup). Root cause: py3.13’s vintage of the private _interpreters C module has a latent thread/subint-interaction issue that _interpreters.exec() silently fails to progress under tractor’s multi-trio usage pattern — even though a minimal standalone threading.Thread + _interpreters.exec() reproducer works fine on the same Python. Empirically py3.14 fixes it.
Fix (from this session): tighten the _has_subints gate in tractor.spawn._subint from “private module importable” to “public concurrent.interpreters present” — which is 3.14+ only. This leaves subint_proc() unchanged in behavior (we still call the private _interpreters.create('legacy') etc. under the hood) but refuses to engage on 3.13.
Also tightened the matching gate in tractor.spawn._spawn.try_set_start_method('subint') and rev’d the corresponding error messages from “3.13+” to “3.14+” with a sentence explaining why. Test-module pytest.importorskip switched from _interpreters → concurrent.interpreters to match.
pytest-timeout dep + skipon_spawn_backend marker plumbing
Added pytest-timeout>=2.3 to the testing dep group with an inline comment pointing at the ai/conc-anal/*.md docs. Applied @pytest.mark.timeout(30, method='thread') (the method='thread' is load-bearing — signal-method SIGALRM suffers the same GIL-starvation path that drops SIGINT in the class-A hang pattern) to the three known- hanging subint tests cataloged in subint_sigint_starvation_issue.md.
Separately code-reviewed the user’s newly-staged skipon_spawn_backend pytest marker implementation in tractor/_testing/pytest.py. Found four bugs:
modmark.kwargs.get(reason)called.get()with the variablereasonas the dict key instead of the string'reason'— user-suppliedreason=was never picked up. (User had already fixed this locally via.get('reason', reason)by the time my review happened — preserved that fix.)- The module-level
pytestmarkbranch suppressed per-test marker handling (theelse:was anelse:rather than independent iteration). mod_pytestmark.markassumed a singleMarkDecorator— broke on the valid-pytestpytestmark = [mark, mark]list form.- Typo:
pytest.Makr→pytest.Mark.
Refactored the hook to use item.iter_markers(name=...) which walks function + class + module scopes uniformly and handles both pytestmark forms natively. ~30 LOC replaced the original ~30 LOC of nested conditionals, all four bugs dissolved. Also updated the marker help string to reflect the variadic *start_methods + reason= surface.
subint_fork_proc prototype attempt
User’s hypothesis: the known trio+fork() issues (python-trio/trio#1614) could be sidestepped by using a sub-interpreter purely as a launchpad — os.fork() from a subint that has never imported trio → child is in a trio-free context. In the child execv() back into python -m tractor._child and the downstream handshake matches trio_proc() identically.
Drafted the prototype at tractor/spawn/_subint.py’s bottom (originally — later moved to its own submod, see below): launchpad-subint creation, bootstrap code-string with os.fork() + execv(), driver-thread orchestration, parent-side ipc_server.wait_for_peer() dance. Registered 'subint_fork' as a new SpawnMethodKey literal, added case 'subint' | 'subint_fork': feature-gate arm in try_set_start_method(), added entry in _methods dict.
CPython-level block discovered
User tested on py3.14 and saw:
Fatal Python error: _PyInterpreterState_DeleteExceptMain: not main interpreter
Python runtime state: initialized
Current thread 0x00007f6b71a456c0 [subint-fork-lau] (most recent call first):
File "<script>", line 2 in <module>
<script>:2: DeprecationWarning: This process (pid=802985) is multi-threaded, use of fork() may lead to deadlocks in the child.
Walked CPython sources (local clone at ~/repos/cpython/):
Modules/posixmodule.c:728PyOS_AfterFork_Child()— post-fork child-side cleanup. Calls_PyInterpreterState_DeleteExceptMain(runtime)withgoto fatal_erroron non-zero status. Has the// Ideally we could guarantee tstate is running main.self-acknowledging-fragile comment directly above.Python/pystate.c:1040_PyInterpreterState_DeleteExceptMain()— the refusal. HardPyStatus_ERR("not main interpreter")gate whentstate->interp != interpreters->main. Docstring formally declares the precondition (“If there is a current interpreter state, it must be the main interpreter”).XXXcomments acknowledge further latent issues within.
Definitive answer to “Open Question 1” of the prototype docstring: no, CPython does not support os.fork() from a non-main sub-interpreter. Not because the fork syscall is blocked (it isn’t — the parent returns a valid pid), but because the child cannot survive CPython’s post-fork initialization. This is an enforced invariant, not an incidental limitation.
Revert: move to stub submod + doc the finding
Per user request:
- Reverted the working
subint_fork_procbody to aNotImplementedErrorstub, MOVED to its own submodtractor/spawn/_subint_fork.py(keeps_subint.pyfocused on the workingsubint_procbackend). - Updated
_spawn.pyto import the stub from the new submod path; kept'subint_fork'inSpawnMethodKey+_methodsso--spawn-backend=subint_forkroutes to a cleanNotImplementedErrorwith pointer to the analysis doc rather than an “invalid backend” error. - Wrote
ai/conc-anal/subint_fork_blocked_by_cpython_post_fork_issue.mdwith the full annotated CPython walkthrough + an upstream-report draft for the CPython issue tracker. Draft has a two-tier ask: ideally “make it work” (pre-fork tstate-swap hook orDeleteExceptFor(interp)variant), minimally “give us a cleanRuntimeErrorin the parent instead of aFatal Python erroraborting the child silently”.
Design discussion — main-interp-thread forkserver workaround
User proposed: set up a “subint forking server” that fork()s on behalf of subint callers. Core insight: the CPython gate is on tstate->interp, not thread identity, so any thread whose tstate is main-interp can fork cleanly. A worker thread attached to main-interp (never entering a subint) satisfies the precondition.
Structurally this is mp.forkserver (which tractor already has as mp_forkserver) but in-process: instead of a separate Python subproc as the fork server, we’d put the forkserver on a thread in the tractor parent process. Pros: faster spawn (no IPC marshalling to external server + no separate Python startup), inherits already-imported modules for free. Cons: less crash isolation (forkserver failure takes the whole process).
Required tractor-side refactor: move the root actor’s trio.run() off main-interp-main-thread (so main-thread can run the forkserver loop). Nontrivial; approximately the same magnitude as “Phase C”.
The design would also not fully resolve the class-A GIL-starvation issue because child actors’ trio still runs inside subints (legacy config, msgspec PEP 684 pending). Would mitigate SIGINT-starvation specifically if signal handling moves to the forkserver thread.
Recommended pre-commitment: a standalone CPython-only smoke test validating the four assumptions the arch rests on, before any tractor-side work.
Smoke-test script drafted
Wrote ai/conc-anal/subint_fork_from_main_thread_smoketest.py: argparse-driven, four scenarios (control_subint_thread_fork reproducing the known-broken case, main_thread_fork baseline, worker_thread_fork the architectural assertion, full_architecture end-to-end with trio in a subint in the forked child). No tractor imports; pure CPython + _interpreters + trio. Bails cleanly on py<3.14. Pass/fail banners per scenario.
User will validate on their py3.14 env next.
Per-code-artifact provenance
tractor/spawn/_subint_fork.py (new submod)
git show 797f57c -- tractor/spawn/_subint_fork.py
NotImplementedError stub for the subint-fork backend. Module docstring + fn docstring explain the attempt, the CPython block, and why the stub is kept in-tree. No runtime behavior beyond raising with a pointer at the conc-anal doc.
tractor/spawn/_spawn.py (modified)
git log 26fb820..HEAD -- tractor/spawn/_spawn.py
- Added
'subint_fork'toSpawnMethodKeyliteral with a block comment explaining the CPython-level block. - Generalized the
case 'subint':arm tocase 'subint' | 'subint_fork':since both use the same py3.14+ gate. - Registered
subint_fork_procin_methodswith a pointer-comment at the analysis doc.
tractor/spawn/_subint.py (modified across session)
git log 26fb820..HEAD -- tractor/spawn/_subint.py
- Tightened
_has_subintsgate: dual-requires publicconcurrent.interpreters+ private_interpreters(tests for py3.14-or-newer on the public-API presence, then uses the private one for legacy-config subints becausemsgspecstill blocks the public isolated mode per jcrist/msgspec#563). - Updated module docstring,
subint_proc()docstring, and gate-error messages to reflect the 3.14+ requirement and the reason (py3.13 wedges under multi-trio usage even though the private module exists there).
tractor/_testing/pytest.py (modified)
git log 26fb820..HEAD -- tractor/_testing/pytest.py
- New
skipon_spawn_backend(*start_methods, reason=...)pytest marker expanded intopytest.mark.skip(reason=...)at collection time viapytest_collection_modifyitems(). - Implementation uses
item.iter_markers(name=...)which walks function + class + module scopes uniformly and handles bothpytestmark = <single Mark>andpytestmark = [mark, ...]forms natively. ~30-LOC single-loop refactor replacing a prior nested conditional that had four bugs (see “Review” narrative above). - Added
pytest.Config/pytest.Function/pytest.FixtureRequesttype annotations on fixture signatures while touching the file.
pyproject.toml (modified)
git log 26fb820..HEAD -- pyproject.toml
Added pytest-timeout>=2.3 to testing dep group with comment pointing at the ai/conc-anal/ docs.
tests/discovery/test_registrar.py,
tests/test_subint_cancellation.py, tests/test_cancellation.py (modified)
git log 26fb820..HEAD -- tests/
Applied @pytest.mark.timeout(30, method='thread') on known-hanging subint tests. Extended comments to cross- reference the ai/conc-anal/*.md docs. method='thread' is documented inline as load-bearing (signal-method SIGALRM suffers the same GIL-starvation path that drops SIGINT).
ai/conc-anal/subint_fork_blocked_by_cpython_post_fork_issue.md (new)
git show 797f57c -- ai/conc-anal/subint_fork_blocked_by_cpython_post_fork_issue.md
Third sibling doc under conc-anal/. Structure: TL;DR, context (“what we tried”), symptom (the user’s exact Fatal Python error output), CPython source walkthrough with excerpted snippets from posixmodule.c + pystate.c, chain summary, definitive answer to Open Question 1, ## Upstream-report draft (for CPython issue tracker) section with a two-tier ask, references.
ai/conc-anal/subint_fork_from_main_thread_smoketest.py (new, THIS turn)
Zero-tractor-import smoke test for the proposed workaround architecture. Four argparse-driven scenarios covering the control case + baseline + arch-critical case + end-to-end. Pass/fail banners per scenario; clean --help output; py3.13 early-exit.
Non-code output (verbatim)
The strace signature that kicked off the CPython
walkthrough
--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
write(16, "\2", 1) = -1 EAGAIN (Resource temporarily unavailable)
rt_sigreturn({mask=[WINCH]}) = 139801964688928
Key user quotes framing the direction
ok actually we get this [fatal error] … see if you can take a look at what’s going on, in particular wrt to cpython’s sources. pretty sure there’s a local copy at ~/repos/cpython/
(Drove the CPython walkthrough that produced the definitive refusal chain.)
is there any reason we can’t just sidestep this “must fork from main thread in main subint” issue by simply ensuring a “subint forking server” is always setup prior to invoking trio in a non-main-thread subint …
(Drove the main-interp-thread-forkserver architectural discussion + smoke-test script design.)
CPython source tags for quick jump-back
Modules/posixmodule.c:728 PyOS_AfterFork_Child()
Modules/posixmodule.c:753 // Ideally we could guarantee tstate is running main.
Modules/posixmodule.c:778 status = _PyInterpreterState_DeleteExceptMain(runtime);
Python/pystate.c:1040 _PyInterpreterState_DeleteExceptMain()
Python/pystate.c:1044-1047 tstate->interp != main → PyStatus_ERR("not main interpreter")