Sixth and final diagnostic pass — after all 4
cascade fixes landed (FD hygiene, pidfd wait,
`_parent_chan_cs` wiring, bounded peer-clear), the
actual last gate on
`test_nested_multierrors[subint_forkserver]`
turned out to be **pytest's default
`--capture=fd` stdout/stderr capture**, not
anything in the runtime cascade.
Empirical result: `pytest -s` → test PASSES in
6.20s. Default `--capture=fd` → hangs forever.
Mechanism: pytest replaces the parent's fds 1,2
with pipe write-ends it reads from. Fork children
inherit those pipes (since `_close_inherited_fds`
correctly preserves stdio). The error-propagation
cascade in a multi-level cancel test generates
7+ actors each logging multiple `RemoteActorError`
/ `ExceptionGroup` tracebacks — enough output to
fill Linux's 64KB pipe buffer. Writes block,
subactors can't progress, processes don't exit,
`_ForkedProc.wait` hangs.
Self-critical aside: I earlier tested w/ and w/o
`-s` and both hung, concluding "capture-pipe
ruled out". That was wrong — at that time fixes
1-4 weren't all in place, so the test was
failing at deeper levels long before reaching
the "produce lots of output" phase. Once the
cascade could actually tear down cleanly, enough
output flowed to hit the pipe limit. Order-of-
operations mistake: ruling something out based
on a test that was failing for a different
reason.
Deats,
- `subint_forkserver_test_cancellation_leak_issue
.md`: new section "Update — VERY late: pytest
capture pipe IS the final gate" w/ DIAG timeline
showing `trio.run` fully returns, diagnosis of
pipe-fill mechanism, retrospective on the
earlier wrong ruling-out, and fix direction
(redirect subactor stdout/stderr to `/dev/null`
in fork-child prelude, conditional on
pytest-detection or opt-in flag)
- `tests/test_cancellation.py`: skip-mark reason
rewritten to describe the capture-pipe gate
specifically; cross-refs the new doc section
- `tests/spawn/test_subint_forkserver.py`: the
orphan-SIGINT test regresses back to xfail.
Previously passed after the FD-hygiene fix,
but the new `wait_for_no_more_peers(
move_on_after=3.0)` bound in `async_main`'s
teardown added up to 3s latency, pushing
orphan-subactor exit past the test's 10s poll
window. Real fix: faster orphan-side teardown
OR extend poll window to 15s
No runtime code changes in this commit — just
test-mark adjustments + doc wrap-up.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Fifth diagnostic pass pinpointed the hang to
`async_main`'s finally block — every stuck actor
reaches `FINALLY ENTER` but never `RETURNING`.
Specifically `await ipc_server.wait_for_no_more_
peers()` never returns when a peer-channel handler
is stuck: the `_no_more_peers` Event is set only
when `server._peers` empties, and stuck handlers
keep their channels registered.
Wrap the call in `trio.move_on_after(3.0)` + a
warning-log on timeout that records the still-
connected peer count. 3s is enough for any
graceful cancel-ack round-trip; beyond that we're
in bug territory and need to proceed with local
teardown so the parent's `_ForkedProc.wait()` can
unblock. Defensive-in-depth regardless of the
underlying bug — a local finally shouldn't block
on remote cooperation forever.
Verified: with this fix, ALL 15 actors reach
`async_main: RETURNING` (up from 10/15 before).
Test still hangs past 45s though — there's at
least one MORE unbounded wait downstream of
`async_main`. Candidates enumerated in the doc
update (`open_root_actor` finally /
`actor.cancel()` internals / trio.run bg tasks /
`_serve_ipc_eps` finally). Skip-mark stays on
`test_nested_multierrors[subint_forkserver]`.
Also updates
`subint_forkserver_test_cancellation_leak_issue.md`
with the new pinpoint + summary of the 6-item
investigation win list:
1. FD hygiene fix (`_close_inherited_fds`) —
orphan-SIGINT closed
2. pidfd-based `_ForkedProc.wait` — cancellable
3. `_parent_chan_cs` wiring — shielded parent-chan
loop now breakable
4. `wait_for_no_more_peers` bound — THIS commit
5. Ruled-out hypotheses: tree-kill missing, stuck
socket recv, capture-pipe fill (all wrong)
6. Remaining unknown: at least one more unbounded
wait in the teardown cascade above `async_main`
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Fourth diagnostic pass — instrument `_worker`'s
fork-child branch (`pre child_target()` / `child_
target RETURNED rc=N` / `about to os._exit(rc)`)
and `_trio_main` boundaries (`about to trio.run` /
`trio.run RETURNED NORMALLY` / `FINALLY`). Test
config: depth=1/breadth=2 = 1 root + 14 forked =
15 actors total.
Fresh-run results,
- **9 processes complete the full flow**:
`trio.run RETURNED NORMALLY` → `child_target
RETURNED rc=0` → `os._exit(0)`. These are tree
LEAVES (errorers) plus their direct parents
(depth-0 spawners) — they actually exit
- **5 processes stuck INSIDE `trio.run(trio_
main)`**: hit "about to trio.run" but never
see "trio.run RETURNED NORMALLY". These are
root + top-level spawners + one intermediate
The deadlock is in `async_main` itself, NOT the
peer-channel loops. Specifically, the outer
`async with root_tn:` in `async_main` never exits
for the 5 stuck actors, so the cascade wedges:
trio.run never returns
→ _trio_main finally never runs
→ _worker never reaches os._exit(rc)
→ process never dies
→ parent's _ForkedProc.wait() blocks
→ parent's nursery hangs
→ parent's async_main hangs
→ (recurse up)
The precise new question: **what task in the 5
stuck actors' `async_main` never completes?**
Candidates:
1. shielded parent-chan `process_messages` task
in `root_tn` — but we cancel it via
`_parent_chan_cs.cancel()` in `Actor.cancel()`,
which only runs during
`open_root_actor.__aexit__`, which itself runs
only after `async_main`'s outer unwind — which
doesn't happen. So the shield isn't broken in
this path.
2. `actor_nursery._join_procs.wait()` or similar
inline in the backend `*_proc` flow.
3. `_ForkedProc.wait()` on a grandchild that DID
exit — but pidfd_open watch didn't fire (race
between `pidfd_open` and the child exiting?).
Most specific next probe: add DIAG around
`_ForkedProc.wait()` enter/exit to see whether
pidfd-based wait returns for every grandchild
exit. If a stuck parent's `_ForkedProc.wait()`
never returns despite its child exiting → pidfd
mechanism has a race bug under nested forkserver.
Asymmetry observed in the cascade tree: some d=0
spawners exit cleanly, others stick, even though
they started identically. Not purely depth-
determined — some race condition in nursery
teardown when multiple siblings error
simultaneously.
No code changes — diagnosis-only.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Third diagnostic pass on
`test_nested_multierrors[subint_forkserver]` hang.
Two prior hypotheses ruled out + a new, more
specific deadlock shape identified.
Ruled out,
- **capture-pipe fill** (`-s` flag changes test):
retested explicitly — `test_nested_multierrors`
hangs identically with and without `-s`. The
earlier observation was likely a competing
pytest process I had running in another session
holding registry state
- **stuck peer-chan recv that cancel can't
break**: pivot from the prior pass. With
`handle_stream_from_peer` instrumented at ENTER
/ `except trio.Cancelled:` / finally: 40
ENTERs, ZERO `trio.Cancelled` hits. Cancel never
reaches those tasks at all — the recvs are
fine, nothing is telling them to stop
Actual deadlock shape: multi-level mutual wait.
root blocks on spawner.wait()
spawner blocks on grandchild.wait()
grandchild blocks on errorer.wait()
errorer Actor.cancel() ran, but proc
never exits
`Actor.cancel()` fired in 12 PIDs — but NOT in
root + 2 direct spawners. Those 3 have peer
handlers stuck because their own `Actor.cancel()`
never runs, which only runs when the enclosing
`tractor.open_nursery()` exits, which waits on
`_ForkedProc.wait()` for the child pidfd to
signal, which only signals when the child
process fully exits.
Refined question: **why does an errorer process
not exit after its `Actor.cancel()` completes?**
Three hypotheses (unverified):
1. `_parent_chan_cs.cancel()` fires but the
shielded loop's recv is stuck in a way cancel
still can't break
2. `async_main`'s post-cancel unwind has other
tasks in `root_tn` awaiting something that
never arrives (e.g. outbound IPC reply)
3. `os._exit(rc)` in `_worker` never runs because
`_child_target` never returns
Next-session probes (priority order):
1. instrument `_worker`'s fork-child branch —
confirm whether `child_target()` returns /
`os._exit(rc)` is reached for errorer PIDs
2. instrument `async_main`'s final unwind — see
which await in teardown doesn't complete
3. compare under `trio_proc` backend at the
equivalent level to spot divergence
No code changes — diagnosis-only.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Two new sections in
`subint_forkserver_test_cancellation_leak_issue.md`
documenting continued investigation of the
`test_nested_multierrors[subint_forkserver]` peer-
channel-loop hang:
1. **"Attempted fix (DID NOT work) — hypothesis
(3)"**: tried sync-closing peer channels' raw
socket fds from `_serve_ipc_eps`'s finally block
(iterate `server._peers`, `_chan._transport.
stream.socket.close()`). Theory was that sync
close would propagate as `EBADF` /
`ClosedResourceError` into the stuck
`recv_some()` and unblock it. Result: identical
hang. Either trio holds an internal fd
reference that survives external close, or the
stuck recv isn't even the root blocker. Either
way: ruled out, experiment reverted, skip-mark
restored.
2. **"Aside: `-s` flag changes behavior for peer-
intensive tests"**: noticed
`test_context_stream_semantics.py` under
`subint_forkserver` hangs with default
`--capture=fd` but passes with `-s`
(`--capture=no`). Working hypothesis: subactors
inherit pytest's capture pipe (fds 1,2 — which
`_close_inherited_fds` deliberately preserves);
verbose subactor logging fills the buffer,
writes block, deadlock. Fix direction (if
confirmed): redirect subactor stdout/stderr to
`/dev/null` or a file in `_actor_child_main`.
Not a blocker on the main investigation;
deserves its own mini-tracker.
Both sections are diagnosis-only — no code changes
in this commit.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Two-part stopgap for the still-hanging
`test_nested_multierrors[subint_forkserver]`:
1. Skip-mark the test via
`@pytest.mark.skipon_spawn_backend('subint_forkserver',
reason=...)` so it stops blocking the test
matrix while the remaining bug is being chased.
The reason string cross-refs the conc-anal doc
for full context.
2. Update the conc-anal doc
(`subint_forkserver_test_cancellation_leak_issue.md`) with the
empirical state after the three nested- cancel fix commits
(`0cd0b633` FD scrub + `fe540d02` pidfd wait + `57935804` parent-chan
shield break) landed, narrowing the remaining hang from "everything
broken" to "peer-channel loops don't exit on `service_tn` cancel".
Deats from the DIAGDEBUG instrumentation pass,
- 80 `process_messages` ENTERs, 75 EXITs → 5 stuck
- ALL 40 `shield=True` ENTERs matched EXIT — the
`_parent_chan_cs.cancel()` wiring from `57935804`
works as intended for shielded loops.
- the 5 stuck loops are all `shield=False` peer-
channel handlers in `handle_stream_from_peer`
(inbound connections handled by
`stream_handler_tn`, which IS `service_tn` in the
current config).
- after `_parent_chan_cs.cancel()` fires, NEW
shielded loops appear on the session reg_addr
port — probably discovery-layer reconnection;
doesn't block teardown but indicates the cascade
has more moving parts than expected.
The remaining unknown: why don't the 5 peer-channel loops exit when
`service_tn.cancel_scope.cancel()` fires? They're not shielded, they're
inside the service_tn scope, a standard cancel should propagate through.
Some fork-config-specific divergence keeps them alive. Doc lists three
follow-up experiments (stackscope dump, side-by-side `trio_proc`
comparison, audit of the `tractor/ipc/_server.py:448` `except
trio.Cancelled:` path).
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Major rewrite of
`subint_forkserver_test_cancellation_leak_issue.md`
after empirical investigation revealed the earlier
"descendant-leak + missing tree-kill" diagnosis
conflated two unrelated symptoms:
1. **5-zombie leak holding `:1616`** — turned out to
be a self-inflicted cleanup bug: `pkill`-ing a bg
pytest task (SIGTERM/SIGKILL, no SIGINT) skipped
the SC graceful cancel cascade entirely. Codified
the real fix — SIGINT-first ladder w/ bounded
wait before SIGKILL — in e5e2afb5 (`run-tests`
SKILL) and
`feedback_sc_graceful_cancel_first.md`.
2. **`test_nested_multierrors[subint_forkserver]`
hangs indefinitely** — the actual backend bug,
and it's a deadlock not a leak.
Deats,
- new diagnosis: all 5 procs are kernel-`S` in
`do_epoll_wait`; pytest-main's trio-cache workers
are in `os.waitpid` waiting for children that are
themselves waiting on IPC that never arrives —
graceful `Portal.cancel_actor` cascade never
reaches its targets
- tree-structure evidence: asymmetric depth across
two identical `run_in_actor` calls — child 1
(3 threads) spawns both its grandchildren; child 2
(1 thread) never completes its first nursery
`run_in_actor`. Smells like a race on fork-
inherited state landing differently per spawn
ordering
- new hypothesis: `os.fork()` from a subactor
inherits the ROOT parent's IPC listener FDs
transitively. Grandchildren end up with three
overlapping FD sets (own + direct-parent + root),
so IPC routing becomes ambiguous. Predicts bug
scales with fork depth — matches reality: single-
level spawn works, multi-level hangs
- ruled out: `_ForkedProc.kill()` tree-kill (never
reaches hard-kill path), `:1616` contention (fixed
by `reg_addr` fixture wiring), GIL starvation
(each subactor has its own OS process+GIL),
child-side KBI absorption (`_trio_main` only
catches KBI at `trio.run()` callsite, reached
only on trio-loop exit)
- four fix directions ranked: (1) blanket post-fork
`closerange()`, (2) `FD_CLOEXEC` + audit,
(3) targeted FD cleanup via `actor.ipc_server`
handle, (4) `os.posix_spawn` w/ `file_actions`.
Vote: (3) — surgical, doesn't break the "no exec"
design of `subint_forkserver`
- standalone repro added (`spawn_and_error(breadth=
2, depth=1)` under `trio.fail_after(20)`)
- stopgap: skip `test_nested_multierrors` + multi-
level-spawn tests under the backend via
`@pytest.mark.skipon_spawn_backend(...)` until
fix lands
Killing the "tree-kill descendants" fix-direction
section: it addressed a bug that didn't exist.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
New `ai/conc-anal/
subint_forkserver_test_cancellation_leak_issue.md`
captures a descendant-leak surfaced while wiring
`subint_forkserver` into the full test matrix:
running `tests/test_cancellation.py` under
`--spawn-backend=subint_forkserver` reproducibly
leaks **exactly 5** `subint-forkserv` comm-named
child processes that survive session exit, each
holding a `LISTEN` on `:1616` (the tractor default
registry addr) — and therefore poisons every
subsequent test session that defaults to that addr.
Deats,
- TL;DR + ruled-out checks confirming the procs are
ours (not piker / other tractor-embedding apps) —
`/proc/$pid/cmdline` + cwd both resolve to this
repo's `py314/` venv
- root cause: `_ForkedProc.kill()` is PID-scoped
(plain `os.kill(SIGKILL)` to the direct child),
not tree-scoped — grandchildren spawned during a
multi-level cancel test get reparented to init and
inherit the registry listen socket
- proposed fix directions ranked: (1) put each
forkserver-spawned subactor in its own process-
group (`os.setpgrp()` in fork-child) + tree-kill
via `os.killpg(pgid, SIGKILL)` on teardown,
(2) `PR_SET_CHILD_SUBREAPER` on root, (3) explicit
`/proc/<pid>/task/*/children` walk. Vote: (1) —
POSIX-standard, aligns w/ `start_new_session=True`
semantics in `subprocess.Popen` / trio's
`open_process`
- inline reproducer + cleanup recipe scoped to
`$(pwd)/py314/bin/python.*pytest.*spawn-backend=
subint_forkserver` so cleanup doesn't false-flag
unrelated tractor procs (consistent w/
`run-tests` skill's zombie-check guidance)
Stopgap hygiene fix (wiring `reg_addr` through the 5
leaky tests in `test_cancellation.py`) is incoming as
a follow-up — that one stops the blast radius, but
zombies still accumulate per-run until the real
tree-kill fix lands.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Empirical follow-up to the xfail'd orphan-SIGINT test:
the hang is **not** "trio can't install a handler on a
non-main thread" (the original hypothesis from the
`child_sigint` scaffold commit). On py3.14:
- `threading.current_thread() is threading.main_thread()`
IS True post-fork — CPython re-designates the
fork-inheriting thread as "main" correctly
- trio's `KIManager` SIGINT handler IS installed in the
subactor (`signal.getsignal(SIGINT)` confirms)
- the kernel DOES deliver SIGINT to the thread
But `faulthandler` dumps show the subactor wedged in
`trio/_core/_io_epoll.py::get_events` — trio's
wakeup-fd mechanism (which turns SIGINT into an epoll-wake)
isn't firing. So the `except KeyboardInterrupt` at
`tractor/spawn/_entry.py::_trio_main:164` — the runtime's
intentional "KBI-as-OS-cancel" path — never fires.
Deats,
- new `ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md`
(+385 LOC): full writeup — TL;DR, symptom reproducer,
the "intentional cancel path" the bug defeats,
diagnostic evidence (`faulthandler` output +
`getsignal` probe), ruled-out hypotheses
(non-main-thread issue, wakeup-fd inheritance,
KBI-as-trio-check-exception), and fix directions
- `test_orphaned_subactor_sigint_cleanup_DRAFT` xfail
`reason` + test docstring rewritten to match the
refined understanding — old wording blamed the
non-main-thread path, new wording points at the
`epoll_wait` wedge + cross-refs the new conc-anal doc
- `_subint_forkserver` module docstring's
`child_sigint='trio'` bullet updated: now notes trio's
handler is already correctly installed, so the flag may
end up a no-op / doc-only mode once the real root cause
is fixed
Closing the gap aligns with existing design intent (make
the already-designed "KBI-as-OS-cancel" behavior actually
fire), not a new feature.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Follow-up tracker companion to the module-docstring TODO
added in `372a0f32`. Catalogs why `_subint_forkserver`'s
two "non-trio thread" constraints
(`fork_from_worker_thread()` +
`run_subint_in_worker_thread()` both allocating dedicated
`threading.Thread`s; test helper named
`run_fork_in_non_trio_thread`) exist today, and which of
them would dissolve once msgspec PEP 684 support ships
(`msgspec#563`) and tractor flips to isolated-mode subints.
Deats,
- three reasons enumerated for the current constraints:
- class-A GIL-starvation — **fixed** by isolated mode:
subints don't share main's GIL so abandoned-thread
contention disappears
- destroy race / tstate-recycling from `subint_proc` —
**unclear**: `_PyXI_Enter` + `_PyXI_Exit` are
cross-mode, so isolated doesn't obviously fix it;
needs empirical retest on py3.14 + isolated API
- fork-from-main-interp-tstate (the CPython-level
`_PyInterpreterState_DeleteExceptMain` gate) — the
narrow reason for using a dedicated thread; **probably
fixed** IF the destroy-race also resolves (bc trio's
cache threads never drove subints → clean main-interp
tstate)
- TL;DR table of which constraints unwind under each
resolution branch
- four-step audit plan for when `msgspec#563` lands:
- flip `_subint` to isolated mode
- empirical destroy-race retest
- audit `_subint_forkserver.py` — drop `non_trio`
qualifier / maybe inline primitives
- doc fallout — close the three `subint_*_issue.md`
siblings w/ post-mortem notes
Also, cross-refs the three sibling `conc-anal/` docs, PEPs
684 + 734, `msgspec#563`, and `tractor#379` (the overall
subint spawn-backend tracking issue).
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
New pytest module `tests/spawn/test_subint_forkserver.py`
drives the forkserver primitives from inside a real
`trio.run()` in the parent — the runtime shape tractor will
actually use when we wire up a `subint_forkserver` spawn
backend proper. Complements the standalone no-trio-in-parent
`ai/conc-anal/subint_fork_from_main_thread_smoketest.py`.
Deats,
- new test pkg `tests/spawn/` (+ empty `__init__.py`)
- two tests, both `@pytest.mark.timeout(30, method='thread')`
for the GIL-hostage safety reason doc'd in
`ai/conc-anal/subint_sigint_starvation_issue.md`:
- `test_fork_from_worker_thread_via_trio` — parent-side
plumbing baseline. `trio.run()` off-loads forkserver
prims via `trio.to_thread.run_sync()` + asserts the
child reaps cleanly
- `test_fork_and_run_trio_in_child` — end-to-end: forked
child calls `run_subint_in_worker_thread()` with a
bootstrap str that does `trio.run()` in a fresh subint
- both tests wrap the inner `trio.run()` in a
`dump_on_hang()` for post-mortem if the outer
`pytest-timeout` fires
- intentionally NOT using `--spawn-backend` — the tests
drive the primitives directly rather than going through
tractor's spawn-method registry (which the forkserver
isn't plugged into yet)
Also, rename `run_trio_in_subint()` →
`run_subint_in_worker_thread()` for naming consistency with
the sibling `fork_from_worker_thread()`. The action is really
"host a subint on a worker thread", not specifically "run
trio" — trio just happens to be the typical payload.
Propagate the rename to the smoketest.
Further, add a "TODO — cleanup gated on msgspec PEP 684
support" section to the `_subint_forkserver` module
docstring: flags the dedicated-`threading.Thread` design as
potentially-revisable once isolated-mode subints are viable
in tractor. Cross-refs `msgspec#563` + `tractor#379` and
points at an audit-plan conc-anal doc we'll add next.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
The smoketest (prior commit) empirically validated the
"fork-from-main-interp-worker-thread" arch on py3.14. Promote
the validated primitives out of the `ai/conc-anal/` smoketest
into `tractor.spawn._subint_forkserver` so they can eventually
be wired into a real "subint forkserver" spawn backend.
Deats,
- new module `tractor/spawn/_subint_forkserver.py` (337 LOC):
- `fork_from_worker_thread(child_target, thread_name)` —
spawn a main-interp `threading.Thread`, call `os.fork()`
from it, shuttle the child pid back to main via a pipe
- `run_trio_in_subint(bootstrap, ...)` — post-fork helper:
create a fresh subint + drive `_interpreters.exec()` on
a dedicated worker thread running the `bootstrap` str
(typically imports `trio`, defines an async entry, calls
`trio.run()`)
- `wait_child(pid, expect_exit_ok)` — `os.waitpid()` +
pass/fail classification reusable from harness AND the
eventual real spawn path
- feature-gated py3.14+ via the public
`concurrent.interpreters` presence check; matches the gate
in `tractor.spawn._subint`
- module docstring doc's the CPython-block context
(cross-refs `_subint_fork` stub + the two `conc-anal/`
docs) and status: EXPERIMENTAL, not yet registered in
`_spawn._methods`
Also, refactor the smoketest
`ai/conc-anal/subint_fork_from_main_thread_smoketest.py` to
import the primitives from the new module rather than inline
its own copies. Keeps the smoketest and the tractor-side
impl in sync as the forkserver design evolves; the smoketest
remains a zero-`tractor`-runtime CPython-level check
(imports ONLY the three primitives, no runtime bring-up).
Status: next step is to drive these from a parent-side
`trio.run()` and hook the returned child pid into the normal
actor-nursery/IPC flow — then register `subint_forkserver`
as a `SpawnMethodKey` in `_spawn.py`.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Standalone script to validate the "main-interp worker-thread
forkserver + subint-hosted trio" arch proposed as a workaround
to the CPython-level refusal doc'd in
`ai/conc-anal/subint_fork_blocked_by_cpython_post_fork_issue.md`.
Deliberately NOT a `tractor` test — zero `tractor` imports.
Uses `_interpreters` (private stdlib) + `os.fork()` directly so
pass/fail is a property of CPython alone, independent of our
runtime. Requires py3.14+.
Deats,
- four scenarios via `--scenario`:
- `control_subint_thread_fork` — the KNOWN-BROKEN case as a
harness sanity; if the child DOESN'T abort, our analysis
is wrong
- `main_thread_fork` — baseline sanity, must always succeed
- `worker_thread_fork` — architectural assertion: regular
`threading.Thread` attached to main interp calls
`os.fork()`; child should survive post-fork cleanup
- `full_architecture` — end-to-end: fork from a main-interp
worker thread, then in child create a subint driving a
worker thread running `trio.run()`
- exit code 0 on EXPECTED outcome (for `control_*` that means
"child aborted", not "child succeeded")
- each scenario prints a self-contained pass/fail banner; use
`os.waitpid()` of the parent + per-scenario status prints to
observe the child's fate
Also, log NLNet provenance for this session's three-sub-phase
work (py3.13 gate tightening, `pytest-timeout` + marker
refactor, `subint_fork` prototype → CPython-block finding).
Prompt-IO: ai/prompt-io/claude/20260422T200723Z_797f57c_prompt_io.md
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Empirical finding: the WIP `subint_fork_proc` scaffold
landed in `cf0e3e6f` does *not* work on current CPython.
The `fork()` syscall succeeds in the parent, but the
CHILD aborts immediately during
`PyOS_AfterFork_Child()` →
`_PyInterpreterState_DeleteExceptMain()`, which gates
on the current tstate belonging to the main interp —
the child dies with `Fatal Python error: not main
interpreter`.
CPython devs acknowledge the fragility with an in-source
comment (`// Ideally we could guarantee tstate is running
main.`) but expose no user-facing hook to satisfy the
precondition — so the strategy is structurally dead until
upstream changes.
Rather than delete the scaffold, reshape it into a
documented dead-end so the next person with this idea
lands on the reason rather than rediscovering the same
CPython-level refusal.
Deats,
- Move `subint_fork_proc` out of `tractor.spawn._subint`
into a new `tractor.spawn._subint_fork` dedicated
module (153 LOC). Module + fn docstrings now describe
the blockage directly; the fn body is trimmed to a
`NotImplementedError` pointing at the analysis doc —
no more dead-code `bootstrap` sketch bloating
`_subint.py`.
- `_spawn.py`: keep `'subint_fork'` in `SpawnMethodKey`
+ the `_methods` dispatch so
`--spawn-backend=subint_fork` routes to a clean
`NotImplementedError` rather than "invalid backend";
comment calls out the blockage. Collapse the duplicate
py3.14 feature-gate in `try_set_start_method()` into a
combined `case 'subint' | 'subint_fork':` arm.
- New 337-line analysis:
`ai/conc-anal/subint_fork_blocked_by_cpython_post_fork_issue.md`.
Annotated walkthrough from the user-visible fatal
error down to the specific `Modules/posixmodule.c` +
`Python/pystate.c` source lines enforcing the refusal,
plus an upstream-report draft.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Add two more tests to the catalog in
`conc-anal/subint_sigint_starvation_issue.md` — same
signal-wakeup-fd-saturation fingerprint (abandoned legacy-subint driver
threads → shared-GIL starvation → `write() = EAGAIN` on the wakeup pipe
→ silent SIGINT drop), different load patterns.
Deats,
- `test_cancel_while_childs_child_in_sync_sleep[subint-False]`: nested
actor-tree + sync-sleeping grandchild. Under `trio`/`mp_*` the "zombie
reaper" is a subproc `SIGKILL`; no equivalent exists under subint, so
the grandchild persists in its abandoned driver thread. Often only
manifests under full-suite runs (earlier tests seed the
abandoned-thread pool).
- `test_multierror_fast_nursery[subint-25-0.5]`: 25 concurrent subactors
all go through teardown on the multierror. Bounded hard-kills run in
parallel — so the total budget is ~3s, not 3s × 25. Leaves 25
abandoned driver threads simultaneously alive, an extreme pressure
multiplier. `strace` shows several successful `write(16, "\2", 1) = 1`
(GIL round-robin IS giving main brief slices) before finally
saturating with `EAGAIN`.
Also include a `pstree -snapt <pid>` capture showing
16+ live `{subint-driver[<interp_id>}` threads at the
moment of hang — the direct GIL-contender population.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Classify and write up the two distinct hang modes hit during Phase
B subint bringup (issue #379) so future triage doesn't re-derive them
from scratch.
Deats, two new `ai/conc-anal/` docs,
- `subint_sigint_starvation_issue.md`: abandoned legacy-subint thread
+ shared GIL → main trio loop starves → signal-wakeup-fd pipe fills
→ `SIGINT` silently dropped (`strace` shows `write() = EAGAIN` on the
wakeup-fd). Un- Ctrl-C-able. Structurally a CPython limit; blocked on
`msgspec` PEP 684 (jcrist/msgspec#563)
- `subint_cancel_delivery_hang_issue.md`: parent-side trio task parks on
an orphaned IPC channel after subint teardown — no clean EOF delivered
to the waiting receive. Ctrl-C-able (main loop iterates fine); OUR bug
to fix. Candidate fix: explicit parent-side channel abort in
`subint_proc`'s hard-kill teardown
Cross-link the docs from their test reproducers,
- `test_stale_entry_is_deleted` (→ starvation class): wrap
`trio.run(main)` in `dump_on_hang(seconds=20)` so a future regression
captures a stack dump. Kept un- skipped so the dump file is
inspectable
- `test_subint_non_checkpointing_child` (→ delivery class): extend
docstring with a "KNOWN ISSUE" block pointing at the analysis
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code