Empirical follow-up to the xfail'd orphan-SIGINT test:
the hang is **not** "trio can't install a handler on a
non-main thread" (the original hypothesis from the
`child_sigint` scaffold commit). On py3.14:
- `threading.current_thread() is threading.main_thread()`
IS True post-fork — CPython re-designates the
fork-inheriting thread as "main" correctly
- trio's `KIManager` SIGINT handler IS installed in the
subactor (`signal.getsignal(SIGINT)` confirms)
- the kernel DOES deliver SIGINT to the thread
But `faulthandler` dumps show the subactor wedged in
`trio/_core/_io_epoll.py::get_events` — trio's
wakeup-fd mechanism (which turns SIGINT into an epoll-wake)
isn't firing. So the `except KeyboardInterrupt` at
`tractor/spawn/_entry.py::_trio_main:164` — the runtime's
intentional "KBI-as-OS-cancel" path — never fires.
Deats,
- new `ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md`
(+385 LOC): full writeup — TL;DR, symptom reproducer,
the "intentional cancel path" the bug defeats,
diagnostic evidence (`faulthandler` output +
`getsignal` probe), ruled-out hypotheses
(non-main-thread issue, wakeup-fd inheritance,
KBI-as-trio-check-exception), and fix directions
- `test_orphaned_subactor_sigint_cleanup_DRAFT` xfail
`reason` + test docstring rewritten to match the
refined understanding — old wording blamed the
non-main-thread path, new wording points at the
`epoll_wait` wedge + cross-refs the new conc-anal doc
- `_subint_forkserver` module docstring's
`child_sigint='trio'` bullet updated: now notes trio's
handler is already correctly installed, so the flag may
end up a no-op / doc-only mode once the real root cause
is fixed
Closing the gap aligns with existing design intent (make
the already-designed "KBI-as-OS-cancel" behavior actually
fire), not a new feature.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Follow-up tracker companion to the module-docstring TODO
added in `372a0f32`. Catalogs why `_subint_forkserver`'s
two "non-trio thread" constraints
(`fork_from_worker_thread()` +
`run_subint_in_worker_thread()` both allocating dedicated
`threading.Thread`s; test helper named
`run_fork_in_non_trio_thread`) exist today, and which of
them would dissolve once msgspec PEP 684 support ships
(`msgspec#563`) and tractor flips to isolated-mode subints.
Deats,
- three reasons enumerated for the current constraints:
- class-A GIL-starvation — **fixed** by isolated mode:
subints don't share main's GIL so abandoned-thread
contention disappears
- destroy race / tstate-recycling from `subint_proc` —
**unclear**: `_PyXI_Enter` + `_PyXI_Exit` are
cross-mode, so isolated doesn't obviously fix it;
needs empirical retest on py3.14 + isolated API
- fork-from-main-interp-tstate (the CPython-level
`_PyInterpreterState_DeleteExceptMain` gate) — the
narrow reason for using a dedicated thread; **probably
fixed** IF the destroy-race also resolves (bc trio's
cache threads never drove subints → clean main-interp
tstate)
- TL;DR table of which constraints unwind under each
resolution branch
- four-step audit plan for when `msgspec#563` lands:
- flip `_subint` to isolated mode
- empirical destroy-race retest
- audit `_subint_forkserver.py` — drop `non_trio`
qualifier / maybe inline primitives
- doc fallout — close the three `subint_*_issue.md`
siblings w/ post-mortem notes
Also, cross-refs the three sibling `conc-anal/` docs, PEPs
684 + 734, `msgspec#563`, and `tractor#379` (the overall
subint spawn-backend tracking issue).
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
New pytest module `tests/spawn/test_subint_forkserver.py`
drives the forkserver primitives from inside a real
`trio.run()` in the parent — the runtime shape tractor will
actually use when we wire up a `subint_forkserver` spawn
backend proper. Complements the standalone no-trio-in-parent
`ai/conc-anal/subint_fork_from_main_thread_smoketest.py`.
Deats,
- new test pkg `tests/spawn/` (+ empty `__init__.py`)
- two tests, both `@pytest.mark.timeout(30, method='thread')`
for the GIL-hostage safety reason doc'd in
`ai/conc-anal/subint_sigint_starvation_issue.md`:
- `test_fork_from_worker_thread_via_trio` — parent-side
plumbing baseline. `trio.run()` off-loads forkserver
prims via `trio.to_thread.run_sync()` + asserts the
child reaps cleanly
- `test_fork_and_run_trio_in_child` — end-to-end: forked
child calls `run_subint_in_worker_thread()` with a
bootstrap str that does `trio.run()` in a fresh subint
- both tests wrap the inner `trio.run()` in a
`dump_on_hang()` for post-mortem if the outer
`pytest-timeout` fires
- intentionally NOT using `--spawn-backend` — the tests
drive the primitives directly rather than going through
tractor's spawn-method registry (which the forkserver
isn't plugged into yet)
Also, rename `run_trio_in_subint()` →
`run_subint_in_worker_thread()` for naming consistency with
the sibling `fork_from_worker_thread()`. The action is really
"host a subint on a worker thread", not specifically "run
trio" — trio just happens to be the typical payload.
Propagate the rename to the smoketest.
Further, add a "TODO — cleanup gated on msgspec PEP 684
support" section to the `_subint_forkserver` module
docstring: flags the dedicated-`threading.Thread` design as
potentially-revisable once isolated-mode subints are viable
in tractor. Cross-refs `msgspec#563` + `tractor#379` and
points at an audit-plan conc-anal doc we'll add next.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
The smoketest (prior commit) empirically validated the
"fork-from-main-interp-worker-thread" arch on py3.14. Promote
the validated primitives out of the `ai/conc-anal/` smoketest
into `tractor.spawn._subint_forkserver` so they can eventually
be wired into a real "subint forkserver" spawn backend.
Deats,
- new module `tractor/spawn/_subint_forkserver.py` (337 LOC):
- `fork_from_worker_thread(child_target, thread_name)` —
spawn a main-interp `threading.Thread`, call `os.fork()`
from it, shuttle the child pid back to main via a pipe
- `run_trio_in_subint(bootstrap, ...)` — post-fork helper:
create a fresh subint + drive `_interpreters.exec()` on
a dedicated worker thread running the `bootstrap` str
(typically imports `trio`, defines an async entry, calls
`trio.run()`)
- `wait_child(pid, expect_exit_ok)` — `os.waitpid()` +
pass/fail classification reusable from harness AND the
eventual real spawn path
- feature-gated py3.14+ via the public
`concurrent.interpreters` presence check; matches the gate
in `tractor.spawn._subint`
- module docstring doc's the CPython-block context
(cross-refs `_subint_fork` stub + the two `conc-anal/`
docs) and status: EXPERIMENTAL, not yet registered in
`_spawn._methods`
Also, refactor the smoketest
`ai/conc-anal/subint_fork_from_main_thread_smoketest.py` to
import the primitives from the new module rather than inline
its own copies. Keeps the smoketest and the tractor-side
impl in sync as the forkserver design evolves; the smoketest
remains a zero-`tractor`-runtime CPython-level check
(imports ONLY the three primitives, no runtime bring-up).
Status: next step is to drive these from a parent-side
`trio.run()` and hook the returned child pid into the normal
actor-nursery/IPC flow — then register `subint_forkserver`
as a `SpawnMethodKey` in `_spawn.py`.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Standalone script to validate the "main-interp worker-thread
forkserver + subint-hosted trio" arch proposed as a workaround
to the CPython-level refusal doc'd in
`ai/conc-anal/subint_fork_blocked_by_cpython_post_fork_issue.md`.
Deliberately NOT a `tractor` test — zero `tractor` imports.
Uses `_interpreters` (private stdlib) + `os.fork()` directly so
pass/fail is a property of CPython alone, independent of our
runtime. Requires py3.14+.
Deats,
- four scenarios via `--scenario`:
- `control_subint_thread_fork` — the KNOWN-BROKEN case as a
harness sanity; if the child DOESN'T abort, our analysis
is wrong
- `main_thread_fork` — baseline sanity, must always succeed
- `worker_thread_fork` — architectural assertion: regular
`threading.Thread` attached to main interp calls
`os.fork()`; child should survive post-fork cleanup
- `full_architecture` — end-to-end: fork from a main-interp
worker thread, then in child create a subint driving a
worker thread running `trio.run()`
- exit code 0 on EXPECTED outcome (for `control_*` that means
"child aborted", not "child succeeded")
- each scenario prints a self-contained pass/fail banner; use
`os.waitpid()` of the parent + per-scenario status prints to
observe the child's fate
Also, log NLNet provenance for this session's three-sub-phase
work (py3.13 gate tightening, `pytest-timeout` + marker
refactor, `subint_fork` prototype → CPython-block finding).
Prompt-IO: ai/prompt-io/claude/20260422T200723Z_797f57c_prompt_io.md
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Empirical finding: the WIP `subint_fork_proc` scaffold
landed in `cf0e3e6f` does *not* work on current CPython.
The `fork()` syscall succeeds in the parent, but the
CHILD aborts immediately during
`PyOS_AfterFork_Child()` →
`_PyInterpreterState_DeleteExceptMain()`, which gates
on the current tstate belonging to the main interp —
the child dies with `Fatal Python error: not main
interpreter`.
CPython devs acknowledge the fragility with an in-source
comment (`// Ideally we could guarantee tstate is running
main.`) but expose no user-facing hook to satisfy the
precondition — so the strategy is structurally dead until
upstream changes.
Rather than delete the scaffold, reshape it into a
documented dead-end so the next person with this idea
lands on the reason rather than rediscovering the same
CPython-level refusal.
Deats,
- Move `subint_fork_proc` out of `tractor.spawn._subint`
into a new `tractor.spawn._subint_fork` dedicated
module (153 LOC). Module + fn docstrings now describe
the blockage directly; the fn body is trimmed to a
`NotImplementedError` pointing at the analysis doc —
no more dead-code `bootstrap` sketch bloating
`_subint.py`.
- `_spawn.py`: keep `'subint_fork'` in `SpawnMethodKey`
+ the `_methods` dispatch so
`--spawn-backend=subint_fork` routes to a clean
`NotImplementedError` rather than "invalid backend";
comment calls out the blockage. Collapse the duplicate
py3.14 feature-gate in `try_set_start_method()` into a
combined `case 'subint' | 'subint_fork':` arm.
- New 337-line analysis:
`ai/conc-anal/subint_fork_blocked_by_cpython_post_fork_issue.md`.
Annotated walkthrough from the user-visible fatal
error down to the specific `Modules/posixmodule.c` +
`Python/pystate.c` source lines enforcing the refusal,
plus an upstream-report draft.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Add two more tests to the catalog in
`conc-anal/subint_sigint_starvation_issue.md` — same
signal-wakeup-fd-saturation fingerprint (abandoned legacy-subint driver
threads → shared-GIL starvation → `write() = EAGAIN` on the wakeup pipe
→ silent SIGINT drop), different load patterns.
Deats,
- `test_cancel_while_childs_child_in_sync_sleep[subint-False]`: nested
actor-tree + sync-sleeping grandchild. Under `trio`/`mp_*` the "zombie
reaper" is a subproc `SIGKILL`; no equivalent exists under subint, so
the grandchild persists in its abandoned driver thread. Often only
manifests under full-suite runs (earlier tests seed the
abandoned-thread pool).
- `test_multierror_fast_nursery[subint-25-0.5]`: 25 concurrent subactors
all go through teardown on the multierror. Bounded hard-kills run in
parallel — so the total budget is ~3s, not 3s × 25. Leaves 25
abandoned driver threads simultaneously alive, an extreme pressure
multiplier. `strace` shows several successful `write(16, "\2", 1) = 1`
(GIL round-robin IS giving main brief slices) before finally
saturating with `EAGAIN`.
Also include a `pstree -snapt <pid>` capture showing
16+ live `{subint-driver[<interp_id>}` threads at the
moment of hang — the direct GIL-contender population.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Log the `claude-opus-4-7` collab that produced `e92e3cd2` ("Doc `subint`
backend hang classes + arm `dump_on_hang`"). Substantive bc the two new
`ai/conc-anal/` docs were jointly authored — user framed the two-class
split + set candidate-fix ordering for the class-2 (Ctrl-C-able) hang;
claude drafted the prose and the test-side cross-linking comments.
`.raw.md` is in diff-ref mode — per-file pointers via `git diff
e92e3cd2~1..e92e3cd2 -- <path>` rather than re-embedding content that
already lives in `git log -p`.
Prompt-IO: ai/prompt-io/claude/20260420T192739Z_5e8cd8b2_prompt_io.md
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Classify and write up the two distinct hang modes hit during Phase
B subint bringup (issue #379) so future triage doesn't re-derive them
from scratch.
Deats, two new `ai/conc-anal/` docs,
- `subint_sigint_starvation_issue.md`: abandoned legacy-subint thread
+ shared GIL → main trio loop starves → signal-wakeup-fd pipe fills
→ `SIGINT` silently dropped (`strace` shows `write() = EAGAIN` on the
wakeup-fd). Un- Ctrl-C-able. Structurally a CPython limit; blocked on
`msgspec` PEP 684 (jcrist/msgspec#563)
- `subint_cancel_delivery_hang_issue.md`: parent-side trio task parks on
an orphaned IPC channel after subint teardown — no clean EOF delivered
to the waiting receive. Ctrl-C-able (main loop iterates fine); OUR bug
to fix. Candidate fix: explicit parent-side channel abort in
`subint_proc`'s hard-kill teardown
Cross-link the docs from their test reproducers,
- `test_stale_entry_is_deleted` (→ starvation class): wrap
`trio.run(main)` in `dump_on_hang(seconds=20)` so a future regression
captures a stack dump. Kept un- skipped so the dump file is
inspectable
- `test_subint_non_checkpointing_child` (→ delivery class): extend
docstring with a "KNOWN ISSUE" block pointing at the analysis
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Log the `claude-opus-4-7` session that produced
the `_subint.py` dedicated-thread fix (`26fb8206`).
Substantive bc the patch was entirely AI-generated;
raw log also preserves the CPython-internals
research informing Phase B.3 hard-kill work.
Prompt-IO: ai/prompt-io/claude/20260418T042526Z_26fb820_prompt_io.md
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Replace the B.1 scaffold stub w/ a working spawn
flow driving PEP 734 sub-interpreters on dedicated
OS threads.
Deats,
- use private `_interpreters` C mod (not the public
`concurrent.interpreters` API) to get `'legacy'`
subint config — avoids PEP 684 C-ext compat
issues w/ `msgspec` and other deps missing the
`Py_mod_multiple_interpreters` slot
- bootstrap subint via code-string calling new
`_actor_child_main()` from `_child.py` (shared
entry for both CLI and subint backends)
- drive subint lifetime on an OS thread using
`trio.to_thread.run_sync(_interpreters.exec, ..)`
- full supervision lifecycle mirrors `trio_proc`:
`ipc_server.wait_for_peer()` → send `SpawnSpec`
→ yield `Portal` via `task_status.started()`
- graceful shutdown awaits the subint's inner
`trio.run()` completing; cancel path sends
`portal.cancel_actor()` then waits for thread
join before `_interpreters.destroy()`
Also,
- extract `_actor_child_main()` from `_child.py`
`__main__` block as callable entry shape bc the
subint needs it for code-string bootstrap
- add `"subint"` to the `_runtime.py` spawn-method
check so child accepts `SpawnSpec` over IPC
Prompt-IO: ai/prompt-io/claude/20260417T124437Z_5cd6df5_prompt_io.md
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Log the `claude-opus-4-7` design session that produced the phased plan
(A: modularize `_spawn`, B: `_subint` backend, C: harness) and concrete
Phase A file-split for #379. Substantive bc the plan directly drives
upcoming impl.
Prompt-IO: ai/prompt-io/claude/20260417T034918Z_9703210_prompt_io.md
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Replace verbose inline code dumps in `.raw.md`
entries with terse summaries and `git diff`
cmd references. Add `diff_cmd` metadata to each
entry's YAML frontmatter so readers can reproduce
the actual output diff.
Also,
- rename `multiaddr_declare_eps.md_` -> `.md`
(drop trailing `_` suffix)
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
New locality-aware addr preference for multihomed
actors: UDS > local TCP > remote TCP. Uses
`ipaddress` + `socket.getaddrinfo()` to detect
whether a `TCPAddress` is on the local host.
Deats,
- `_is_local_addr()` checks loopback or
same-host IPs via interface enumeration
- `prefer_addr()` classifies an addr list into
three tiers and picks the latest entry from
the highest-priority non-empty tier
- `query_actor()` and `wait_for_actor()` now
call `prefer_addr()` instead of grabbing
`addrs[-1]` or a single pre-selected addr
Also,
- `Registrar.find_actor()` returns full
`list[UnwrappedAddress]|None` so callers can
apply transport preference
Prompt-IO: ai/prompt-io/claude/20260414T163300Z_befedc49_prompt_io.md
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Provide a service-table parsing API for downstream projects (like
`piker`) to declare per-actor transport bind addresses as a config map
of actor-name -> multiaddr strings (e.g. from a TOML `[network]`
section).
Deats,
- `EndpointsTable` type alias: input `dict[str, list[str|tuple]]`.
- `ParsedEndpoints` type alias: output `dict[str, list[Address]]`.
- `parse_endpoints()` iterates the table and delegates each entry to the
existing `tractor.discovery._discovery.wrap_address()` helper, which
handles maddr strings, raw `(host, port)` tuples, and pre-wrapped
`Address` objs.
- UDS maddrs use the multiaddr spec name `/unix/...` (not tractor's
internal `/uds/` proto_key)
Also add new tests,
- 7 new pure unit tests (no trio runtime): TCP-only, mixed tpts,
unwrapped tuples, mixed str+tuple, unsupported proto (`/udp/`),
empty table, empty actor list
- all 22 multiaddr tests pass rn.
Prompt-IO:
ai/prompt-io/claude/20260413T205048Z_269d939c_prompt_io.md
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Add 9 test variants (6 fns) covering all three
`tpt_bind_addrs` code paths in `open_root_actor()`:
- registrar w/ explicit bind (eq, subset, disjoint)
- non-registrar w/ explicit bind (same/diff
bindspace) using `daemon` fixture
- non-registrar default random bind (baseline)
- maddr string input parsing
- registrar merge produces union
- `open_nursery()` forwards `tpt_bind_addrs`
Fix type-mixing bug at `_root.py:446` where the
registrar merge path did `set(Address + tuple)`,
preventing dedup and causing double-bind `OSError`.
Wrap `uw_reg_addrs` before the set union so both
sides are `Address` objs.
Also,
- add prompt-io output log for this session
- stage original prompt input for tracking
Prompt-IO: ai/prompt-io/claude/20260413T192116Z_f851f28_prompt_io.md
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Documents the diagnostic session tracing why
per-`ctx_key` locking alone doesn't close the
`_Cache.run_ctx` teardown race — the lock pops
in the exiting caller's task but resource cleanup
runs in the `run_ctx` task inside `service_tn`.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Reproduce the piker `open_cached_client('kraken')` scenario: identical
`ctx_key` callers share one cached resource, and a new task re-enters
during `__aexit__` — hitting `assert not resources.get()` bc `values`
was popped but `resources` wasn't yet.
Deats,
- `test_moc_reentry_during_teardown` uses an `in_aexit` event to
deterministically land in the teardown window.
- marked `xfail(raises=AssertionError)` against unpatched code (fix in
`9e49eddd` or wtv lands on the `maybe_open_ctx_locking` or thereafter
patch branch).
Also, add prompt-io log for the session.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Prompt-IO: ai/prompt-io/claude/20260406T193125Z_85f9c5d_prompt_io.md
Add `test_per_ctx_key_resource_lifecycle` to verify that per-key user
tracking correctly tears down resources independently - exercises the
fix from 02b2ef18 where a global `_Cache.users` counter caused stale
cache hits when the same `acm_func` was called with different kwargs.
Also, add a paired `acm_with_resource()` helper `@acm` that yields its
`resource_id` for per-key testing in the above suite.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Prompt-IO: ai/prompt-io/claude/20260406T172848Z_02b2ef1_prompt_io.md