344 lines
14 KiB
Markdown
344 lines
14 KiB
Markdown
|
|
---
|
||
|
|
model: claude-opus-4-7[1m]
|
||
|
|
service: claude
|
||
|
|
timestamp: 2026-04-22T20:07:23Z
|
||
|
|
git_ref: 797f57c
|
||
|
|
diff_cmd: git log 26fb820..HEAD # all session commits since the destroy-race fix log
|
||
|
|
---
|
||
|
|
|
||
|
|
Session-spanning conversation covering the Phase B hardening
|
||
|
|
of the `subint` spawn-backend and an investigation into a
|
||
|
|
proposed `subint_fork` follow-up which turned out to be
|
||
|
|
blocked at the CPython level. This log is a narrative capture
|
||
|
|
of the substantive turns (not every message) and references
|
||
|
|
the concrete code + docs the session produced. Per diff-ref
|
||
|
|
mode the actual code diffs are pointed at via `git log` on
|
||
|
|
each ref rather than duplicated inline.
|
||
|
|
|
||
|
|
## Narrative of the substantive turns
|
||
|
|
|
||
|
|
### Py3.13 hang / gate tightening
|
||
|
|
|
||
|
|
Diagnosed a reproducible hang of the `subint` backend under
|
||
|
|
py3.13 (test_spawning tests wedge after root-actor bringup).
|
||
|
|
Root cause: py3.13's vintage of the private `_interpreters` C
|
||
|
|
module has a latent thread/subint-interaction issue that
|
||
|
|
`_interpreters.exec()` silently fails to progress under
|
||
|
|
tractor's multi-trio usage pattern — even though a minimal
|
||
|
|
standalone `threading.Thread` + `_interpreters.exec()`
|
||
|
|
reproducer works fine on the same Python. Empirically
|
||
|
|
py3.14 fixes it.
|
||
|
|
|
||
|
|
Fix (from this session): tighten the `_has_subints` gate in
|
||
|
|
`tractor.spawn._subint` from "private module importable" to
|
||
|
|
"public `concurrent.interpreters` present" — which is 3.14+
|
||
|
|
only. This leaves `subint_proc()` unchanged in behavior (we
|
||
|
|
still call the *private* `_interpreters.create('legacy')`
|
||
|
|
etc. under the hood) but refuses to engage on 3.13.
|
||
|
|
|
||
|
|
Also tightened the matching gate in
|
||
|
|
`tractor.spawn._spawn.try_set_start_method('subint')` and
|
||
|
|
rev'd the corresponding error messages from "3.13+" to
|
||
|
|
"3.14+" with a sentence explaining why. Test-module
|
||
|
|
`pytest.importorskip` switched from `_interpreters` →
|
||
|
|
`concurrent.interpreters` to match.
|
||
|
|
|
||
|
|
### `pytest-timeout` dep + `skipon_spawn_backend` marker plumbing
|
||
|
|
|
||
|
|
Added `pytest-timeout>=2.3` to the `testing` dep group with
|
||
|
|
an inline comment pointing at the `ai/conc-anal/*.md` docs.
|
||
|
|
Applied `@pytest.mark.timeout(30, method='thread')` (the
|
||
|
|
`method='thread'` is load-bearing — `signal`-method
|
||
|
|
`SIGALRM` suffers the same GIL-starvation path that drops
|
||
|
|
`SIGINT` in the class-A hang pattern) to the three known-
|
||
|
|
hanging subint tests cataloged in
|
||
|
|
`subint_sigint_starvation_issue.md`.
|
||
|
|
|
||
|
|
Separately code-reviewed the user's newly-staged
|
||
|
|
`skipon_spawn_backend` pytest marker implementation in
|
||
|
|
`tractor/_testing/pytest.py`. Found four bugs:
|
||
|
|
|
||
|
|
1. `modmark.kwargs.get(reason)` called `.get()` with the
|
||
|
|
*variable* `reason` as the dict key instead of the string
|
||
|
|
`'reason'` — user-supplied `reason=` was never picked up.
|
||
|
|
(User had already fixed this locally via `.get('reason',
|
||
|
|
reason)` by the time my review happened — preserved that
|
||
|
|
fix.)
|
||
|
|
2. The module-level `pytestmark` branch suppressed per-test
|
||
|
|
marker handling (the `else:` was an `else:` rather than
|
||
|
|
independent iteration).
|
||
|
|
3. `mod_pytestmark.mark` assumed a single
|
||
|
|
`MarkDecorator` — broke on the valid-pytest `pytestmark =
|
||
|
|
[mark, mark]` list form.
|
||
|
|
4. Typo: `pytest.Makr` → `pytest.Mark`.
|
||
|
|
|
||
|
|
Refactored the hook to use `item.iter_markers(name=...)`
|
||
|
|
which walks function + class + module scopes uniformly and
|
||
|
|
handles both `pytestmark` forms natively. ~30 LOC replaced
|
||
|
|
the original ~30 LOC of nested conditionals, all four bugs
|
||
|
|
dissolved. Also updated the marker help string to reflect
|
||
|
|
the variadic `*start_methods` + `reason=` surface.
|
||
|
|
|
||
|
|
### `subint_fork_proc` prototype attempt
|
||
|
|
|
||
|
|
User's hypothesis: the known trio+`fork()` issues
|
||
|
|
(python-trio/trio#1614) could be sidestepped by using a
|
||
|
|
sub-interpreter purely as a launchpad — `os.fork()` from a
|
||
|
|
subint that has never imported trio → child is in a
|
||
|
|
trio-free context. In the child `execv()` back into
|
||
|
|
`python -m tractor._child` and the downstream handshake
|
||
|
|
matches `trio_proc()` identically.
|
||
|
|
|
||
|
|
Drafted the prototype at `tractor/spawn/_subint.py`'s bottom
|
||
|
|
(originally — later moved to its own submod, see below):
|
||
|
|
launchpad-subint creation, bootstrap code-string with
|
||
|
|
`os.fork()` + `execv()`, driver-thread orchestration,
|
||
|
|
parent-side `ipc_server.wait_for_peer()` dance. Registered
|
||
|
|
`'subint_fork'` as a new `SpawnMethodKey` literal, added
|
||
|
|
`case 'subint' | 'subint_fork':` feature-gate arm in
|
||
|
|
`try_set_start_method()`, added entry in `_methods` dict.
|
||
|
|
|
||
|
|
### CPython-level block discovered
|
||
|
|
|
||
|
|
User tested on py3.14 and saw:
|
||
|
|
|
||
|
|
```
|
||
|
|
Fatal Python error: _PyInterpreterState_DeleteExceptMain: not main interpreter
|
||
|
|
Python runtime state: initialized
|
||
|
|
|
||
|
|
Current thread 0x00007f6b71a456c0 [subint-fork-lau] (most recent call first):
|
||
|
|
File "<script>", line 2 in <module>
|
||
|
|
<script>:2: DeprecationWarning: This process (pid=802985) is multi-threaded, use of fork() may lead to deadlocks in the child.
|
||
|
|
```
|
||
|
|
|
||
|
|
Walked CPython sources (local clone at `~/repos/cpython/`):
|
||
|
|
|
||
|
|
- **`Modules/posixmodule.c:728` `PyOS_AfterFork_Child()`** —
|
||
|
|
post-fork child-side cleanup. Calls
|
||
|
|
`_PyInterpreterState_DeleteExceptMain(runtime)` with
|
||
|
|
`goto fatal_error` on non-zero status. Has the
|
||
|
|
`// Ideally we could guarantee tstate is running main.`
|
||
|
|
self-acknowledging-fragile comment directly above.
|
||
|
|
|
||
|
|
- **`Python/pystate.c:1040`
|
||
|
|
`_PyInterpreterState_DeleteExceptMain()`** — the
|
||
|
|
refusal. Hard `PyStatus_ERR("not main interpreter")` gate
|
||
|
|
when `tstate->interp != interpreters->main`. Docstring
|
||
|
|
formally declares the precondition ("If there is a
|
||
|
|
current interpreter state, it *must* be the main
|
||
|
|
interpreter"). `XXX` comments acknowledge further latent
|
||
|
|
issues within.
|
||
|
|
|
||
|
|
Definitive answer to "Open Question 1" of the prototype
|
||
|
|
docstring: **no, CPython does not support `os.fork()` from
|
||
|
|
a non-main sub-interpreter**. Not because the fork syscall
|
||
|
|
is blocked (it isn't — the parent returns a valid pid),
|
||
|
|
but because the child cannot survive CPython's post-fork
|
||
|
|
initialization. This is an enforced invariant, not an
|
||
|
|
incidental limitation.
|
||
|
|
|
||
|
|
### Revert: move to stub submod + doc the finding
|
||
|
|
|
||
|
|
Per user request:
|
||
|
|
|
||
|
|
1. Reverted the working `subint_fork_proc` body to a
|
||
|
|
`NotImplementedError` stub, MOVED to its own submod
|
||
|
|
`tractor/spawn/_subint_fork.py` (keeps `_subint.py`
|
||
|
|
focused on the working `subint_proc` backend).
|
||
|
|
2. Updated `_spawn.py` to import the stub from the new
|
||
|
|
submod path; kept `'subint_fork'` in `SpawnMethodKey` +
|
||
|
|
`_methods` so `--spawn-backend=subint_fork` routes to a
|
||
|
|
clean `NotImplementedError` with pointer to the analysis
|
||
|
|
doc rather than an "invalid backend" error.
|
||
|
|
3. Wrote
|
||
|
|
`ai/conc-anal/subint_fork_blocked_by_cpython_post_fork_issue.md`
|
||
|
|
with the full annotated CPython walkthrough + an
|
||
|
|
upstream-report draft for the CPython issue tracker.
|
||
|
|
Draft has a two-tier ask: ideally "make it work"
|
||
|
|
(pre-fork tstate-swap hook or `DeleteExceptFor(interp)`
|
||
|
|
variant), minimally "give us a clean `RuntimeError` in
|
||
|
|
the parent instead of a `Fatal Python error` aborting
|
||
|
|
the child silently".
|
||
|
|
|
||
|
|
### Design discussion — main-interp-thread forkserver workaround
|
||
|
|
|
||
|
|
User proposed: set up a "subint forking server" that fork()s
|
||
|
|
on behalf of subint callers. Core insight: the CPython gate
|
||
|
|
is on `tstate->interp`, not thread identity, so **any thread
|
||
|
|
whose tstate is main-interp** can fork cleanly. A worker
|
||
|
|
thread attached to main-interp (never entering a subint)
|
||
|
|
satisfies the precondition.
|
||
|
|
|
||
|
|
Structurally this is `mp.forkserver` (which tractor already
|
||
|
|
has as `mp_forkserver`) but **in-process**: instead of a
|
||
|
|
separate Python subproc as the fork server, we'd put the
|
||
|
|
forkserver on a thread in the tractor parent process. Pros:
|
||
|
|
faster spawn (no IPC marshalling to external server + no
|
||
|
|
separate Python startup), inherits already-imported modules
|
||
|
|
for free. Cons: less crash isolation (forkserver failure
|
||
|
|
takes the whole process).
|
||
|
|
|
||
|
|
Required tractor-side refactor: move the root actor's
|
||
|
|
`trio.run()` off main-interp-main-thread (so main-thread can
|
||
|
|
run the forkserver loop). Nontrivial; approximately the same
|
||
|
|
magnitude as "Phase C".
|
||
|
|
|
||
|
|
The design would also not fully resolve the class-A
|
||
|
|
GIL-starvation issue because child actors' trio still runs
|
||
|
|
inside subints (legacy config, msgspec PEP 684 pending).
|
||
|
|
Would mitigate SIGINT-starvation specifically if signal
|
||
|
|
handling moves to the forkserver thread.
|
||
|
|
|
||
|
|
Recommended pre-commitment: a standalone CPython-only smoke
|
||
|
|
test validating the four assumptions the arch rests on,
|
||
|
|
before any tractor-side work.
|
||
|
|
|
||
|
|
### Smoke-test script drafted
|
||
|
|
|
||
|
|
Wrote `ai/conc-anal/subint_fork_from_main_thread_smoketest.py`:
|
||
|
|
argparse-driven, four scenarios (`control_subint_thread_fork`
|
||
|
|
reproducing the known-broken case, `main_thread_fork`
|
||
|
|
baseline, `worker_thread_fork` the architectural assertion,
|
||
|
|
`full_architecture` end-to-end with trio in a subint in the
|
||
|
|
forked child). No `tractor` imports; pure CPython + `_interpreters`
|
||
|
|
+ `trio`. Bails cleanly on py<3.14. Pass/fail banners per
|
||
|
|
scenario.
|
||
|
|
|
||
|
|
User will validate on their py3.14 env next.
|
||
|
|
|
||
|
|
## Per-code-artifact provenance
|
||
|
|
|
||
|
|
### `tractor/spawn/_subint_fork.py` (new submod)
|
||
|
|
|
||
|
|
> `git show 797f57c -- tractor/spawn/_subint_fork.py`
|
||
|
|
|
||
|
|
NotImplementedError stub for the subint-fork backend. Module
|
||
|
|
docstring + fn docstring explain the attempt, the CPython
|
||
|
|
block, and why the stub is kept in-tree. No runtime behavior
|
||
|
|
beyond raising with a pointer at the conc-anal doc.
|
||
|
|
|
||
|
|
### `tractor/spawn/_spawn.py` (modified)
|
||
|
|
|
||
|
|
> `git log 26fb820..HEAD -- tractor/spawn/_spawn.py`
|
||
|
|
|
||
|
|
- Added `'subint_fork'` to `SpawnMethodKey` literal with a
|
||
|
|
block comment explaining the CPython-level block.
|
||
|
|
- Generalized the `case 'subint':` arm to `case 'subint' |
|
||
|
|
'subint_fork':` since both use the same py3.14+ gate.
|
||
|
|
- Registered `subint_fork_proc` in `_methods` with a
|
||
|
|
pointer-comment at the analysis doc.
|
||
|
|
|
||
|
|
### `tractor/spawn/_subint.py` (modified across session)
|
||
|
|
|
||
|
|
> `git log 26fb820..HEAD -- tractor/spawn/_subint.py`
|
||
|
|
|
||
|
|
- Tightened `_has_subints` gate: dual-requires public
|
||
|
|
`concurrent.interpreters` + private `_interpreters`
|
||
|
|
(tests for py3.14-or-newer on the public-API presence,
|
||
|
|
then uses the private one for legacy-config subints
|
||
|
|
because `msgspec` still blocks the public isolated mode
|
||
|
|
per jcrist/msgspec#563).
|
||
|
|
- Updated module docstring, `subint_proc()` docstring, and
|
||
|
|
gate-error messages to reflect the 3.14+ requirement and
|
||
|
|
the reason (py3.13 wedges under multi-trio usage even
|
||
|
|
though the private module exists there).
|
||
|
|
|
||
|
|
### `tractor/_testing/pytest.py` (modified)
|
||
|
|
|
||
|
|
> `git log 26fb820..HEAD -- tractor/_testing/pytest.py`
|
||
|
|
|
||
|
|
- New `skipon_spawn_backend(*start_methods, reason=...)`
|
||
|
|
pytest marker expanded into `pytest.mark.skip(reason=...)`
|
||
|
|
at collection time via
|
||
|
|
`pytest_collection_modifyitems()`.
|
||
|
|
- Implementation uses `item.iter_markers(name=...)` which
|
||
|
|
walks function + class + module scopes uniformly and
|
||
|
|
handles both `pytestmark = <single Mark>` and
|
||
|
|
`pytestmark = [mark, ...]` forms natively. ~30-LOC
|
||
|
|
single-loop refactor replacing a prior nested
|
||
|
|
conditional that had four bugs (see "Review" narrative
|
||
|
|
above).
|
||
|
|
- Added `pytest.Config` / `pytest.Function` /
|
||
|
|
`pytest.FixtureRequest` type annotations on fixture
|
||
|
|
signatures while touching the file.
|
||
|
|
|
||
|
|
### `pyproject.toml` (modified)
|
||
|
|
|
||
|
|
> `git log 26fb820..HEAD -- pyproject.toml`
|
||
|
|
|
||
|
|
Added `pytest-timeout>=2.3` to `testing` dep group with
|
||
|
|
comment pointing at the `ai/conc-anal/` docs.
|
||
|
|
|
||
|
|
### `tests/discovery/test_registrar.py`,
|
||
|
|
`tests/test_subint_cancellation.py`,
|
||
|
|
`tests/test_cancellation.py` (modified)
|
||
|
|
|
||
|
|
> `git log 26fb820..HEAD -- tests/`
|
||
|
|
|
||
|
|
Applied `@pytest.mark.timeout(30, method='thread')` on
|
||
|
|
known-hanging subint tests. Extended comments to cross-
|
||
|
|
reference the `ai/conc-anal/*.md` docs. `method='thread'`
|
||
|
|
is documented inline as load-bearing (`signal`-method
|
||
|
|
SIGALRM suffers the same GIL-starvation path that drops
|
||
|
|
SIGINT).
|
||
|
|
|
||
|
|
### `ai/conc-anal/subint_fork_blocked_by_cpython_post_fork_issue.md` (new)
|
||
|
|
|
||
|
|
> `git show 797f57c -- ai/conc-anal/subint_fork_blocked_by_cpython_post_fork_issue.md`
|
||
|
|
|
||
|
|
Third sibling doc under `conc-anal/`. Structure: TL;DR,
|
||
|
|
context ("what we tried"), symptom (the user's exact
|
||
|
|
`Fatal Python error` output), CPython source walkthrough
|
||
|
|
with excerpted snippets from `posixmodule.c` +
|
||
|
|
`pystate.c`, chain summary, definitive answer to Open
|
||
|
|
Question 1, `## Upstream-report draft (for CPython issue
|
||
|
|
tracker)` section with a two-tier ask, references.
|
||
|
|
|
||
|
|
### `ai/conc-anal/subint_fork_from_main_thread_smoketest.py` (new, THIS turn)
|
||
|
|
|
||
|
|
Zero-tractor-import smoke test for the proposed workaround
|
||
|
|
architecture. Four argparse-driven scenarios covering the
|
||
|
|
control case + baseline + arch-critical case + end-to-end.
|
||
|
|
Pass/fail banners per scenario; clean `--help` output;
|
||
|
|
py3.13 early-exit.
|
||
|
|
|
||
|
|
## Non-code output (verbatim)
|
||
|
|
|
||
|
|
### The `strace` signature that kicked off the CPython
|
||
|
|
walkthrough
|
||
|
|
|
||
|
|
```
|
||
|
|
--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
|
||
|
|
write(16, "\2", 1) = -1 EAGAIN (Resource temporarily unavailable)
|
||
|
|
rt_sigreturn({mask=[WINCH]}) = 139801964688928
|
||
|
|
```
|
||
|
|
|
||
|
|
### Key user quotes framing the direction
|
||
|
|
|
||
|
|
> ok actually we get this [fatal error] ... see if you can
|
||
|
|
> take a look at what's going on, in particular wrt to
|
||
|
|
> cpython's sources. pretty sure there's a local copy at
|
||
|
|
> ~/repos/cpython/
|
||
|
|
|
||
|
|
(Drove the CPython walkthrough that produced the
|
||
|
|
definitive refusal chain.)
|
||
|
|
|
||
|
|
> is there any reason we can't just sidestep this "must fork
|
||
|
|
> from main thread in main subint" issue by simply ensuring
|
||
|
|
> a "subint forking server" is always setup prior to
|
||
|
|
> invoking trio in a non-main-thread subint ...
|
||
|
|
|
||
|
|
(Drove the main-interp-thread-forkserver architectural
|
||
|
|
discussion + smoke-test script design.)
|
||
|
|
|
||
|
|
### CPython source tags for quick jump-back
|
||
|
|
|
||
|
|
```
|
||
|
|
Modules/posixmodule.c:728 PyOS_AfterFork_Child()
|
||
|
|
Modules/posixmodule.c:753 // Ideally we could guarantee tstate is running main.
|
||
|
|
Modules/posixmodule.c:778 status = _PyInterpreterState_DeleteExceptMain(runtime);
|
||
|
|
|
||
|
|
Python/pystate.c:1040 _PyInterpreterState_DeleteExceptMain()
|
||
|
|
Python/pystate.c:1044-1047 tstate->interp != main → PyStatus_ERR("not main interpreter")
|
||
|
|
```
|