tractor/ai/prompt-io/claude/20260422T200723Z_797f57c_pr...

14 KiB
Raw Blame History

Session-spanning conversation covering the Phase B hardening of the subint spawn-backend and an investigation into a proposed subint_fork follow-up which turned out to be blocked at the CPython level. This log is a narrative capture of the substantive turns (not every message) and references the concrete code + docs the session produced. Per diff-ref mode the actual code diffs are pointed at via git log on each ref rather than duplicated inline.

Narrative of the substantive turns

Py3.13 hang / gate tightening

Diagnosed a reproducible hang of the subint backend under py3.13 (test_spawning tests wedge after root-actor bringup). Root cause: py3.13s vintage of the private _interpreters C module has a latent thread/subint-interaction issue that _interpreters.exec() silently fails to progress under tractors multi-trio usage pattern — even though a minimal standalone threading.Thread + _interpreters.exec() reproducer works fine on the same Python. Empirically py3.14 fixes it.

Fix (from this session): tighten the _has_subints gate in tractor.spawn._subint from “private module importable” to “public concurrent.interpreters present” — which is 3.14+ only. This leaves subint_proc() unchanged in behavior (we still call the private _interpreters.create('legacy') etc. under the hood) but refuses to engage on 3.13.

Also tightened the matching gate in tractor.spawn._spawn.try_set_start_method('subint') and revd the corresponding error messages from “3.13+” to “3.14+” with a sentence explaining why. Test-module pytest.importorskip switched from _interpretersconcurrent.interpreters to match.

pytest-timeout dep + skipon_spawn_backend marker plumbing

Added pytest-timeout>=2.3 to the testing dep group with an inline comment pointing at the ai/conc-anal/*.md docs. Applied @pytest.mark.timeout(30, method='thread') (the method='thread' is load-bearing — signal-method SIGALRM suffers the same GIL-starvation path that drops SIGINT in the class-A hang pattern) to the three known- hanging subint tests cataloged in subint_sigint_starvation_issue.md.

Separately code-reviewed the users newly-staged skipon_spawn_backend pytest marker implementation in tractor/_testing/pytest.py. Found four bugs:

  1. modmark.kwargs.get(reason) called .get() with the variable reason as the dict key instead of the string 'reason' — user-supplied reason= was never picked up. (User had already fixed this locally via .get('reason', reason) by the time my review happened — preserved that fix.)
  2. The module-level pytestmark branch suppressed per-test marker handling (the else: was an else: rather than independent iteration).
  3. mod_pytestmark.mark assumed a single MarkDecorator — broke on the valid-pytest pytestmark = [mark, mark] list form.
  4. Typo: pytest.Makrpytest.Mark.

Refactored the hook to use item.iter_markers(name=...) which walks function + class + module scopes uniformly and handles both pytestmark forms natively. ~30 LOC replaced the original ~30 LOC of nested conditionals, all four bugs dissolved. Also updated the marker help string to reflect the variadic *start_methods + reason= surface.

subint_fork_proc prototype attempt

Users hypothesis: the known trio+fork() issues (python-trio/trio#1614) could be sidestepped by using a sub-interpreter purely as a launchpad — os.fork() from a subint that has never imported trio → child is in a trio-free context. In the child execv() back into python -m tractor._child and the downstream handshake matches trio_proc() identically.

Drafted the prototype at tractor/spawn/_subint.pys bottom (originally — later moved to its own submod, see below): launchpad-subint creation, bootstrap code-string with os.fork() + execv(), driver-thread orchestration, parent-side ipc_server.wait_for_peer() dance. Registered 'subint_fork' as a new SpawnMethodKey literal, added case 'subint' | 'subint_fork': feature-gate arm in try_set_start_method(), added entry in _methods dict.

CPython-level block discovered

User tested on py3.14 and saw:

Fatal Python error: _PyInterpreterState_DeleteExceptMain: not main interpreter
Python runtime state: initialized

Current thread 0x00007f6b71a456c0 [subint-fork-lau] (most recent call first):
  File "<script>", line 2 in <module>
<script>:2: DeprecationWarning: This process (pid=802985) is multi-threaded, use of fork() may lead to deadlocks in the child.

Walked CPython sources (local clone at ~/repos/cpython/):

  • Modules/posixmodule.c:728 PyOS_AfterFork_Child() — post-fork child-side cleanup. Calls _PyInterpreterState_DeleteExceptMain(runtime) with goto fatal_error on non-zero status. Has the // Ideally we could guarantee tstate is running main. self-acknowledging-fragile comment directly above.

  • Python/pystate.c:1040 _PyInterpreterState_DeleteExceptMain() — the refusal. Hard PyStatus_ERR("not main interpreter") gate when tstate->interp != interpreters->main. Docstring formally declares the precondition (“If there is a current interpreter state, it must be the main interpreter”). XXX comments acknowledge further latent issues within.

Definitive answer to “Open Question 1” of the prototype docstring: no, CPython does not support os.fork() from a non-main sub-interpreter. Not because the fork syscall is blocked (it isnt — the parent returns a valid pid), but because the child cannot survive CPythons post-fork initialization. This is an enforced invariant, not an incidental limitation.

Revert: move to stub submod + doc the finding

Per user request:

  1. Reverted the working subint_fork_proc body to a NotImplementedError stub, MOVED to its own submod tractor/spawn/_subint_fork.py (keeps _subint.py focused on the working subint_proc backend).
  2. Updated _spawn.py to import the stub from the new submod path; kept 'subint_fork' in SpawnMethodKey + _methods so --spawn-backend=subint_fork routes to a clean NotImplementedError with pointer to the analysis doc rather than an “invalid backend” error.
  3. Wrote ai/conc-anal/subint_fork_blocked_by_cpython_post_fork_issue.md with the full annotated CPython walkthrough + an upstream-report draft for the CPython issue tracker. Draft has a two-tier ask: ideally “make it work” (pre-fork tstate-swap hook or DeleteExceptFor(interp) variant), minimally “give us a clean RuntimeError in the parent instead of a Fatal Python error aborting the child silently”.

Design discussion — main-interp-thread forkserver workaround

User proposed: set up a “subint forking server” that fork()s on behalf of subint callers. Core insight: the CPython gate is on tstate->interp, not thread identity, so any thread whose tstate is main-interp can fork cleanly. A worker thread attached to main-interp (never entering a subint) satisfies the precondition.

Structurally this is mp.forkserver (which tractor already has as mp_forkserver) but in-process: instead of a separate Python subproc as the fork server, wed put the forkserver on a thread in the tractor parent process. Pros: faster spawn (no IPC marshalling to external server + no separate Python startup), inherits already-imported modules for free. Cons: less crash isolation (forkserver failure takes the whole process).

Required tractor-side refactor: move the root actors trio.run() off main-interp-main-thread (so main-thread can run the forkserver loop). Nontrivial; approximately the same magnitude as “Phase C”.

The design would also not fully resolve the class-A GIL-starvation issue because child actors trio still runs inside subints (legacy config, msgspec PEP 684 pending). Would mitigate SIGINT-starvation specifically if signal handling moves to the forkserver thread.

Recommended pre-commitment: a standalone CPython-only smoke test validating the four assumptions the arch rests on, before any tractor-side work.

Smoke-test script drafted

Wrote ai/conc-anal/subint_fork_from_main_thread_smoketest.py: argparse-driven, four scenarios (control_subint_thread_fork reproducing the known-broken case, main_thread_fork baseline, worker_thread_fork the architectural assertion, full_architecture end-to-end with trio in a subint in the forked child). No tractor imports; pure CPython + _interpreters + trio. Bails cleanly on py<3.14. Pass/fail banners per scenario.

User will validate on their py3.14 env next.

Per-code-artifact provenance

tractor/spawn/_subint_fork.py (new submod)

git show 797f57c -- tractor/spawn/_subint_fork.py

NotImplementedError stub for the subint-fork backend. Module docstring + fn docstring explain the attempt, the CPython block, and why the stub is kept in-tree. No runtime behavior beyond raising with a pointer at the conc-anal doc.

tractor/spawn/_spawn.py (modified)

git log 26fb820..HEAD -- tractor/spawn/_spawn.py

  • Added 'subint_fork' to SpawnMethodKey literal with a block comment explaining the CPython-level block.
  • Generalized the case 'subint': arm to case 'subint' | 'subint_fork': since both use the same py3.14+ gate.
  • Registered subint_fork_proc in _methods with a pointer-comment at the analysis doc.

tractor/spawn/_subint.py (modified across session)

git log 26fb820..HEAD -- tractor/spawn/_subint.py

  • Tightened _has_subints gate: dual-requires public concurrent.interpreters + private _interpreters (tests for py3.14-or-newer on the public-API presence, then uses the private one for legacy-config subints because msgspec still blocks the public isolated mode per jcrist/msgspec#563).
  • Updated module docstring, subint_proc() docstring, and gate-error messages to reflect the 3.14+ requirement and the reason (py3.13 wedges under multi-trio usage even though the private module exists there).

tractor/_testing/pytest.py (modified)

git log 26fb820..HEAD -- tractor/_testing/pytest.py

  • New skipon_spawn_backend(*start_methods, reason=...) pytest marker expanded into pytest.mark.skip(reason=...) at collection time via pytest_collection_modifyitems().
  • Implementation uses item.iter_markers(name=...) which walks function + class + module scopes uniformly and handles both pytestmark = <single Mark> and pytestmark = [mark, ...] forms natively. ~30-LOC single-loop refactor replacing a prior nested conditional that had four bugs (see “Review” narrative above).
  • Added pytest.Config / pytest.Function / pytest.FixtureRequest type annotations on fixture signatures while touching the file.

pyproject.toml (modified)

git log 26fb820..HEAD -- pyproject.toml

Added pytest-timeout>=2.3 to testing dep group with comment pointing at the ai/conc-anal/ docs.

tests/discovery/test_registrar.py,

tests/test_subint_cancellation.py, tests/test_cancellation.py (modified)

git log 26fb820..HEAD -- tests/

Applied @pytest.mark.timeout(30, method='thread') on known-hanging subint tests. Extended comments to cross- reference the ai/conc-anal/*.md docs. method='thread' is documented inline as load-bearing (signal-method SIGALRM suffers the same GIL-starvation path that drops SIGINT).

ai/conc-anal/subint_fork_blocked_by_cpython_post_fork_issue.md (new)

git show 797f57c -- ai/conc-anal/subint_fork_blocked_by_cpython_post_fork_issue.md

Third sibling doc under conc-anal/. Structure: TL;DR, context (“what we tried”), symptom (the users exact Fatal Python error output), CPython source walkthrough with excerpted snippets from posixmodule.c + pystate.c, chain summary, definitive answer to Open Question 1, ## Upstream-report draft (for CPython issue tracker) section with a two-tier ask, references.

ai/conc-anal/subint_fork_from_main_thread_smoketest.py (new, THIS turn)

Zero-tractor-import smoke test for the proposed workaround architecture. Four argparse-driven scenarios covering the control case + baseline + arch-critical case + end-to-end. Pass/fail banners per scenario; clean --help output; py3.13 early-exit.

Non-code output (verbatim)

The strace signature that kicked off the CPython

walkthrough

--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
write(16, "\2", 1)                      = -1 EAGAIN (Resource temporarily unavailable)
rt_sigreturn({mask=[WINCH]})            = 139801964688928

Key user quotes framing the direction

ok actually we get this [fatal error] … see if you can take a look at whats going on, in particular wrt to cpythons sources. pretty sure theres a local copy at ~/repos/cpython/

(Drove the CPython walkthrough that produced the definitive refusal chain.)

is there any reason we cant just sidestep this “must fork from main thread in main subint” issue by simply ensuring a “subint forking server” is always setup prior to invoking trio in a non-main-thread subint …

(Drove the main-interp-thread-forkserver architectural discussion + smoke-test script design.)

CPython source tags for quick jump-back

Modules/posixmodule.c:728   PyOS_AfterFork_Child()
Modules/posixmodule.c:753   // Ideally we could guarantee tstate is running main.
Modules/posixmodule.c:778   status = _PyInterpreterState_DeleteExceptMain(runtime);

Python/pystate.c:1040       _PyInterpreterState_DeleteExceptMain()
Python/pystate.c:1044-1047  tstate->interp != main → PyStatus_ERR("not main interpreter")