tractor

Commit Graph

Author	SHA1	Message	Date
Gud Boi	eceed29d4a	Pin forkserver hang to pytest `--capture=fd` Sixth and final diagnostic pass — after all 4 cascade fixes landed (FD hygiene, pidfd wait, `_parent_chan_cs` wiring, bounded peer-clear), the actual last gate on `test_nested_multierrors[subint_forkserver]` turned out to be pytest's default `--capture=fd` stdout/stderr capture, not anything in the runtime cascade. Empirical result: `pytest -s` → test PASSES in 6.20s. Default `--capture=fd` → hangs forever. Mechanism: pytest replaces the parent's fds 1,2 with pipe write-ends it reads from. Fork children inherit those pipes (since `_close_inherited_fds` correctly preserves stdio). The error-propagation cascade in a multi-level cancel test generates 7+ actors each logging multiple `RemoteActorError` / `ExceptionGroup` tracebacks — enough output to fill Linux's 64KB pipe buffer. Writes block, subactors can't progress, processes don't exit, `_ForkedProc.wait` hangs. Self-critical aside: I earlier tested w/ and w/o `-s` and both hung, concluding "capture-pipe ruled out". That was wrong — at that time fixes 1-4 weren't all in place, so the test was failing at deeper levels long before reaching the "produce lots of output" phase. Once the cascade could actually tear down cleanly, enough output flowed to hit the pipe limit. Order-of- operations mistake: ruling something out based on a test that was failing for a different reason. Deats, - `subint_forkserver_test_cancellation_leak_issue .md`: new section "Update — VERY late: pytest capture pipe IS the final gate" w/ DIAG timeline showing `trio.run` fully returns, diagnosis of pipe-fill mechanism, retrospective on the earlier wrong ruling-out, and fix direction (redirect subactor stdout/stderr to `/dev/null` in fork-child prelude, conditional on pytest-detection or opt-in flag) - `tests/test_cancellation.py`: skip-mark reason rewritten to describe the capture-pipe gate specifically; cross-refs the new doc section - `tests/spawn/test_subint_forkserver.py`: the orphan-SIGINT test regresses back to xfail. Previously passed after the FD-hygiene fix, but the new `wait_for_no_more_peers( move_on_after=3.0)` bound in `async_main`'s teardown added up to 3s latency, pushing orphan-subactor exit past the test's 10s poll window. Real fix: faster orphan-side teardown OR extend poll window to 15s No runtime code changes in this commit — just test-mark adjustments + doc wrap-up. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 23:18:14 -04:00
Gud Boi	e312a68d8a	Bound peer-clear wait in `async_main` finally Fifth diagnostic pass pinpointed the hang to `async_main`'s finally block — every stuck actor reaches `FINALLY ENTER` but never `RETURNING`. Specifically `await ipc_server.wait_for_no_more_ peers()` never returns when a peer-channel handler is stuck: the `_no_more_peers` Event is set only when `server._peers` empties, and stuck handlers keep their channels registered. Wrap the call in `trio.move_on_after(3.0)` + a warning-log on timeout that records the still- connected peer count. 3s is enough for any graceful cancel-ack round-trip; beyond that we're in bug territory and need to proceed with local teardown so the parent's `_ForkedProc.wait()` can unblock. Defensive-in-depth regardless of the underlying bug — a local finally shouldn't block on remote cooperation forever. Verified: with this fix, ALL 15 actors reach `async_main: RETURNING` (up from 10/15 before). Test still hangs past 45s though — there's at least one MORE unbounded wait downstream of `async_main`. Candidates enumerated in the doc update (`open_root_actor` finally / `actor.cancel()` internals / trio.run bg tasks / `_serve_ipc_eps` finally). Skip-mark stays on `test_nested_multierrors[subint_forkserver]`. Also updates `subint_forkserver_test_cancellation_leak_issue.md` with the new pinpoint + summary of the 6-item investigation win list: 1. FD hygiene fix (`_close_inherited_fds`) — orphan-SIGINT closed 2. pidfd-based `_ForkedProc.wait` — cancellable 3. `_parent_chan_cs` wiring — shielded parent-chan loop now breakable 4. `wait_for_no_more_peers` bound — THIS commit 5. Ruled-out hypotheses: tree-kill missing, stuck socket recv, capture-pipe fill (all wrong) 6. Remaining unknown: at least one more unbounded wait in the teardown cascade above `async_main` (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 22:34:49 -04:00
Gud Boi	4d0555435b	Narrow forkserver hang to `async_main` outer tn Fourth diagnostic pass — instrument `_worker`'s fork-child branch (`pre child_target()` / `child_ target RETURNED rc=N` / `about to os._exit(rc)`) and `_trio_main` boundaries (`about to trio.run` / `trio.run RETURNED NORMALLY` / `FINALLY`). Test config: depth=1/breadth=2 = 1 root + 14 forked = 15 actors total. Fresh-run results, - 9 processes complete the full flow: `trio.run RETURNED NORMALLY` → `child_target RETURNED rc=0` → `os._exit(0)`. These are tree LEAVES (errorers) plus their direct parents (depth-0 spawners) — they actually exit - 5 processes stuck INSIDE `trio.run(trio_ main)`: hit "about to trio.run" but never see "trio.run RETURNED NORMALLY". These are root + top-level spawners + one intermediate The deadlock is in `async_main` itself, NOT the peer-channel loops. Specifically, the outer `async with root_tn:` in `async_main` never exits for the 5 stuck actors, so the cascade wedges: trio.run never returns → _trio_main finally never runs → _worker never reaches os._exit(rc) → process never dies → parent's _ForkedProc.wait() blocks → parent's nursery hangs → parent's async_main hangs → (recurse up) The precise new question: what task in the 5 stuck actors' `async_main` never completes? Candidates: 1. shielded parent-chan `process_messages` task in `root_tn` — but we cancel it via `_parent_chan_cs.cancel()` in `Actor.cancel()`, which only runs during `open_root_actor.__aexit__`, which itself runs only after `async_main`'s outer unwind — which doesn't happen. So the shield isn't broken in this path. 2. `actor_nursery._join_procs.wait()` or similar inline in the backend `*_proc` flow. 3. `_ForkedProc.wait()` on a grandchild that DID exit — but pidfd_open watch didn't fire (race between `pidfd_open` and the child exiting?). Most specific next probe: add DIAG around `_ForkedProc.wait()` enter/exit to see whether pidfd-based wait returns for every grandchild exit. If a stuck parent's `_ForkedProc.wait()` never returns despite its child exiting → pidfd mechanism has a race bug under nested forkserver. Asymmetry observed in the cascade tree: some d=0 spawners exit cleanly, others stick, even though they started identically. Not purely depth- determined — some race condition in nursery teardown when multiple siblings error simultaneously. No code changes — diagnosis-only. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 21:36:19 -04:00
Gud Boi	ab86f7613d	Refine `subint_forkserver` cancel-cascade diag Third diagnostic pass on `test_nested_multierrors[subint_forkserver]` hang. Two prior hypotheses ruled out + a new, more specific deadlock shape identified. Ruled out, - capture-pipe fill (`-s` flag changes test): retested explicitly — `test_nested_multierrors` hangs identically with and without `-s`. The earlier observation was likely a competing pytest process I had running in another session holding registry state - stuck peer-chan recv that cancel can't break: pivot from the prior pass. With `handle_stream_from_peer` instrumented at ENTER / `except trio.Cancelled:` / finally: 40 ENTERs, ZERO `trio.Cancelled` hits. Cancel never reaches those tasks at all — the recvs are fine, nothing is telling them to stop Actual deadlock shape: multi-level mutual wait. root blocks on spawner.wait() spawner blocks on grandchild.wait() grandchild blocks on errorer.wait() errorer Actor.cancel() ran, but proc never exits `Actor.cancel()` fired in 12 PIDs — but NOT in root + 2 direct spawners. Those 3 have peer handlers stuck because their own `Actor.cancel()` never runs, which only runs when the enclosing `tractor.open_nursery()` exits, which waits on `_ForkedProc.wait()` for the child pidfd to signal, which only signals when the child process fully exits. Refined question: why does an errorer process not exit after its `Actor.cancel()` completes? Three hypotheses (unverified): 1. `_parent_chan_cs.cancel()` fires but the shielded loop's recv is stuck in a way cancel still can't break 2. `async_main`'s post-cancel unwind has other tasks in `root_tn` awaiting something that never arrives (e.g. outbound IPC reply) 3. `os._exit(rc)` in `_worker` never runs because `_child_target` never returns Next-session probes (priority order): 1. instrument `_worker`'s fork-child branch — confirm whether `child_target()` returns / `os._exit(rc)` is reached for errorer PIDs 2. instrument `async_main`'s final unwind — see which await in teardown doesn't complete 3. compare under `trio_proc` backend at the equivalent level to spot divergence No code changes — diagnosis-only. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 21:23:11 -04:00
Gud Boi	7cd47ef7fb	Doc ruled-out fix + capture-pipe aside Two new sections in `subint_forkserver_test_cancellation_leak_issue.md` documenting continued investigation of the `test_nested_multierrors[subint_forkserver]` peer- channel-loop hang: 1. "Attempted fix (DID NOT work) — hypothesis (3)": tried sync-closing peer channels' raw socket fds from `_serve_ipc_eps`'s finally block (iterate `server._peers`, `_chan._transport. stream.socket.close()`). Theory was that sync close would propagate as `EBADF` / `ClosedResourceError` into the stuck `recv_some()` and unblock it. Result: identical hang. Either trio holds an internal fd reference that survives external close, or the stuck recv isn't even the root blocker. Either way: ruled out, experiment reverted, skip-mark restored. 2. "Aside: `-s` flag changes behavior for peer- intensive tests": noticed `test_context_stream_semantics.py` under `subint_forkserver` hangs with default `--capture=fd` but passes with `-s` (`--capture=no`). Working hypothesis: subactors inherit pytest's capture pipe (fds 1,2 — which `_close_inherited_fds` deliberately preserves); verbose subactor logging fills the buffer, writes block, deadlock. Fix direction (if confirmed): redirect subactor stdout/stderr to `/dev/null` or a file in `_actor_child_main`. Not a blocker on the main investigation; deserves its own mini-tracker. Both sections are diagnosis-only — no code changes in this commit. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:48:34 -04:00
Gud Boi	506617c695	Skip-mark + narrow `subint_forkserver` cancel hang Two-part stopgap for the still-hanging `test_nested_multierrors[subint_forkserver]`: 1. Skip-mark the test via `@pytest.mark.skipon_spawn_backend('subint_forkserver', reason=...)` so it stops blocking the test matrix while the remaining bug is being chased. The reason string cross-refs the conc-anal doc for full context. 2. Update the conc-anal doc (`subint_forkserver_test_cancellation_leak_issue.md`) with the empirical state after the three nested- cancel fix commits (`0cd0b633` FD scrub + `fe540d02` pidfd wait + `57935804` parent-chan shield break) landed, narrowing the remaining hang from "everything broken" to "peer-channel loops don't exit on `service_tn` cancel". Deats from the DIAGDEBUG instrumentation pass, - 80 `process_messages` ENTERs, 75 EXITs → 5 stuck - ALL 40 `shield=True` ENTERs matched EXIT — the `_parent_chan_cs.cancel()` wiring from `57935804` works as intended for shielded loops. - the 5 stuck loops are all `shield=False` peer- channel handlers in `handle_stream_from_peer` (inbound connections handled by `stream_handler_tn`, which IS `service_tn` in the current config). - after `_parent_chan_cs.cancel()` fires, NEW shielded loops appear on the session reg_addr port — probably discovery-layer reconnection; doesn't block teardown but indicates the cascade has more moving parts than expected. The remaining unknown: why don't the 5 peer-channel loops exit when `service_tn.cancel_scope.cancel()` fires? They're not shielded, they're inside the service_tn scope, a standard cancel should propagate through. Some fork-config-specific divergence keeps them alive. Doc lists three follow-up experiments (stackscope dump, side-by-side `trio_proc` comparison, audit of the `tractor/ipc/_server.py:448` `except trio.Cancelled:` path). (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:48:34 -04:00
Gud Boi	35da808905	Refine `subint_forkserver` nested-cancel hang diagnosis Major rewrite of `subint_forkserver_test_cancellation_leak_issue.md` after empirical investigation revealed the earlier "descendant-leak + missing tree-kill" diagnosis conflated two unrelated symptoms: 1. 5-zombie leak holding `:1616` — turned out to be a self-inflicted cleanup bug: `pkill`-ing a bg pytest task (SIGTERM/SIGKILL, no SIGINT) skipped the SC graceful cancel cascade entirely. Codified the real fix — SIGINT-first ladder w/ bounded wait before SIGKILL — in `e5e2afb5` (`run-tests` SKILL) and `feedback_sc_graceful_cancel_first.md`. 2. `test_nested_multierrors[subint_forkserver]` hangs indefinitely — the actual backend bug, and it's a deadlock not a leak. Deats, - new diagnosis: all 5 procs are kernel-`S` in `do_epoll_wait`; pytest-main's trio-cache workers are in `os.waitpid` waiting for children that are themselves waiting on IPC that never arrives — graceful `Portal.cancel_actor` cascade never reaches its targets - tree-structure evidence: asymmetric depth across two identical `run_in_actor` calls — child 1 (3 threads) spawns both its grandchildren; child 2 (1 thread) never completes its first nursery `run_in_actor`. Smells like a race on fork- inherited state landing differently per spawn ordering - new hypothesis: `os.fork()` from a subactor inherits the ROOT parent's IPC listener FDs transitively. Grandchildren end up with three overlapping FD sets (own + direct-parent + root), so IPC routing becomes ambiguous. Predicts bug scales with fork depth — matches reality: single- level spawn works, multi-level hangs - ruled out: `_ForkedProc.kill()` tree-kill (never reaches hard-kill path), `:1616` contention (fixed by `reg_addr` fixture wiring), GIL starvation (each subactor has its own OS process+GIL), child-side KBI absorption (`_trio_main` only catches KBI at `trio.run()` callsite, reached only on trio-loop exit) - four fix directions ranked: (1) blanket post-fork `closerange()`, (2) `FD_CLOEXEC` + audit, (3) targeted FD cleanup via `actor.ipc_server` handle, (4) `os.posix_spawn` w/ `file_actions`. Vote: (3) — surgical, doesn't break the "no exec" design of `subint_forkserver` - standalone repro added (`spawn_and_error(breadth= 2, depth=1)` under `trio.fail_after(20)`) - stopgap: skip `test_nested_multierrors` + multi- level-spawn tests under the backend via `@pytest.mark.skipon_spawn_backend(...)` until fix lands Killing the "tree-kill descendants" fix-direction section: it addressed a bug that didn't exist. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:48:34 -04:00
Gud Boi	e3f4f5a387	Add `subint_forkserver` test-cancellation leak doc New `ai/conc-anal/ subint_forkserver_test_cancellation_leak_issue.md` captures a descendant-leak surfaced while wiring `subint_forkserver` into the full test matrix: running `tests/test_cancellation.py` under `--spawn-backend=subint_forkserver` reproducibly leaks exactly 5 `subint-forkserv` comm-named child processes that survive session exit, each holding a `LISTEN` on `:1616` (the tractor default registry addr) — and therefore poisons every subsequent test session that defaults to that addr. Deats, - TL;DR + ruled-out checks confirming the procs are ours (not piker / other tractor-embedding apps) — `/proc/$pid/cmdline` + cwd both resolve to this repo's `py314/` venv - root cause: `_ForkedProc.kill()` is PID-scoped (plain `os.kill(SIGKILL)` to the direct child), not tree-scoped — grandchildren spawned during a multi-level cancel test get reparented to init and inherit the registry listen socket - proposed fix directions ranked: (1) put each forkserver-spawned subactor in its own process- group (`os.setpgrp()` in fork-child) + tree-kill via `os.killpg(pgid, SIGKILL)` on teardown, (2) `PR_SET_CHILD_SUBREAPER` on root, (3) explicit `/proc/<pid>/task//children` walk. Vote: (1) — POSIX-standard, aligns w/ `start_new_session=True` semantics in `subprocess.Popen` / trio's `open_process` - inline reproducer + cleanup recipe scoped to `$(pwd)/py314/bin/python.pytest.*spawn-backend= subint_forkserver` so cleanup doesn't false-flag unrelated tractor procs (consistent w/ `run-tests` skill's zombie-check guidance) Stopgap hygiene fix (wiring `reg_addr` through the 5 leaky tests in `test_cancellation.py`) is incoming as a follow-up — that one stops the blast radius, but zombies still accumulate per-run until the real tree-kill fix lands. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:48:34 -04:00
Gud Boi	a72deef709	Refine `subint_forkserver` orphan-SIGINT diagnosis Empirical follow-up to the xfail'd orphan-SIGINT test: the hang is not "trio can't install a handler on a non-main thread" (the original hypothesis from the `child_sigint` scaffold commit). On py3.14: - `threading.current_thread() is threading.main_thread()` IS True post-fork — CPython re-designates the fork-inheriting thread as "main" correctly - trio's `KIManager` SIGINT handler IS installed in the subactor (`signal.getsignal(SIGINT)` confirms) - the kernel DOES deliver SIGINT to the thread But `faulthandler` dumps show the subactor wedged in `trio/_core/_io_epoll.py::get_events` — trio's wakeup-fd mechanism (which turns SIGINT into an epoll-wake) isn't firing. So the `except KeyboardInterrupt` at `tractor/spawn/_entry.py::_trio_main:164` — the runtime's intentional "KBI-as-OS-cancel" path — never fires. Deats, - new `ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md` (+385 LOC): full writeup — TL;DR, symptom reproducer, the "intentional cancel path" the bug defeats, diagnostic evidence (`faulthandler` output + `getsignal` probe), ruled-out hypotheses (non-main-thread issue, wakeup-fd inheritance, KBI-as-trio-check-exception), and fix directions - `test_orphaned_subactor_sigint_cleanup_DRAFT` xfail `reason` + test docstring rewritten to match the refined understanding — old wording blamed the non-main-thread path, new wording points at the `epoll_wait` wedge + cross-refs the new conc-anal doc - `_subint_forkserver` module docstring's `child_sigint='trio'` bullet updated: now notes trio's handler is already correctly installed, so the flag may end up a no-op / doc-only mode once the real root cause is fixed Closing the gap aligns with existing design intent (make the already-designed "KBI-as-OS-cancel" behavior actually fire), not a new feature. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:48:34 -04:00
Gud Boi	cf2e71d87f	Add `subint_forkserver` PEP 684 audit-plan doc Follow-up tracker companion to the module-docstring TODO added in `372a0f32`. Catalogs why `_subint_forkserver`'s two "non-trio thread" constraints (`fork_from_worker_thread()` + `run_subint_in_worker_thread()` both allocating dedicated `threading.Thread`s; test helper named `run_fork_in_non_trio_thread`) exist today, and which of them would dissolve once msgspec PEP 684 support ships (`msgspec#563`) and tractor flips to isolated-mode subints. Deats, - three reasons enumerated for the current constraints: - class-A GIL-starvation — fixed by isolated mode: subints don't share main's GIL so abandoned-thread contention disappears - destroy race / tstate-recycling from `subint_proc` — unclear: `_PyXI_Enter` + `_PyXI_Exit` are cross-mode, so isolated doesn't obviously fix it; needs empirical retest on py3.14 + isolated API - fork-from-main-interp-tstate (the CPython-level `_PyInterpreterState_DeleteExceptMain` gate) — the narrow reason for using a dedicated thread; probably fixed IF the destroy-race also resolves (bc trio's cache threads never drove subints → clean main-interp tstate) - TL;DR table of which constraints unwind under each resolution branch - four-step audit plan for when `msgspec#563` lands: - flip `_subint` to isolated mode - empirical destroy-race retest - audit `_subint_forkserver.py` — drop `non_trio` qualifier / maybe inline primitives - doc fallout — close the three `subint_*_issue.md` siblings w/ post-mortem notes Also, cross-refs the three sibling `conc-anal/` docs, PEPs 684 + 734, `msgspec#563`, and `tractor#379` (the overall subint spawn-backend tracking issue). (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:48:34 -04:00
Gud Boi	25e400d526	Add trio-parent tests for `_subint_forkserver` New pytest module `tests/spawn/test_subint_forkserver.py` drives the forkserver primitives from inside a real `trio.run()` in the parent — the runtime shape tractor will actually use when we wire up a `subint_forkserver` spawn backend proper. Complements the standalone no-trio-in-parent `ai/conc-anal/subint_fork_from_main_thread_smoketest.py`. Deats, - new test pkg `tests/spawn/` (+ empty `__init__.py`) - two tests, both `@pytest.mark.timeout(30, method='thread')` for the GIL-hostage safety reason doc'd in `ai/conc-anal/subint_sigint_starvation_issue.md`: - `test_fork_from_worker_thread_via_trio` — parent-side plumbing baseline. `trio.run()` off-loads forkserver prims via `trio.to_thread.run_sync()` + asserts the child reaps cleanly - `test_fork_and_run_trio_in_child` — end-to-end: forked child calls `run_subint_in_worker_thread()` with a bootstrap str that does `trio.run()` in a fresh subint - both tests wrap the inner `trio.run()` in a `dump_on_hang()` for post-mortem if the outer `pytest-timeout` fires - intentionally NOT using `--spawn-backend` — the tests drive the primitives directly rather than going through tractor's spawn-method registry (which the forkserver isn't plugged into yet) Also, rename `run_trio_in_subint()` → `run_subint_in_worker_thread()` for naming consistency with the sibling `fork_from_worker_thread()`. The action is really "host a subint on a worker thread", not specifically "run trio" — trio just happens to be the typical payload. Propagate the rename to the smoketest. Further, add a "TODO — cleanup gated on msgspec PEP 684 support" section to the `_subint_forkserver` module docstring: flags the dedicated-`threading.Thread` design as potentially-revisable once isolated-mode subints are viable in tractor. Cross-refs `msgspec#563` + `tractor#379` and points at an audit-plan conc-anal doc we'll add next. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:48:34 -04:00
Gud Boi	82332fbceb	Lift fork prims into `_subint_forkserver` mod The smoketest (prior commit) empirically validated the "fork-from-main-interp-worker-thread" arch on py3.14. Promote the validated primitives out of the `ai/conc-anal/` smoketest into `tractor.spawn._subint_forkserver` so they can eventually be wired into a real "subint forkserver" spawn backend. Deats, - new module `tractor/spawn/_subint_forkserver.py` (337 LOC): - `fork_from_worker_thread(child_target, thread_name)` — spawn a main-interp `threading.Thread`, call `os.fork()` from it, shuttle the child pid back to main via a pipe - `run_trio_in_subint(bootstrap, ...)` — post-fork helper: create a fresh subint + drive `_interpreters.exec()` on a dedicated worker thread running the `bootstrap` str (typically imports `trio`, defines an async entry, calls `trio.run()`) - `wait_child(pid, expect_exit_ok)` — `os.waitpid()` + pass/fail classification reusable from harness AND the eventual real spawn path - feature-gated py3.14+ via the public `concurrent.interpreters` presence check; matches the gate in `tractor.spawn._subint` - module docstring doc's the CPython-block context (cross-refs `_subint_fork` stub + the two `conc-anal/` docs) and status: EXPERIMENTAL, not yet registered in `_spawn._methods` Also, refactor the smoketest `ai/conc-anal/subint_fork_from_main_thread_smoketest.py` to import the primitives from the new module rather than inline its own copies. Keeps the smoketest and the tractor-side impl in sync as the forkserver design evolves; the smoketest remains a zero-`tractor`-runtime CPython-level check (imports ONLY the three primitives, no runtime bring-up). Status: next step is to drive these from a parent-side `trio.run()` and hook the returned child pid into the normal actor-nursery/IPC flow — then register `subint_forkserver` as a `SpawnMethodKey` in `_spawn.py`. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:48:34 -04:00
Gud Boi	de4f470b6c	Add CPython-level `subint_fork` workaround smoketest Standalone script to validate the "main-interp worker-thread forkserver + subint-hosted trio" arch proposed as a workaround to the CPython-level refusal doc'd in `ai/conc-anal/subint_fork_blocked_by_cpython_post_fork_issue.md`. Deliberately NOT a `tractor` test — zero `tractor` imports. Uses `_interpreters` (private stdlib) + `os.fork()` directly so pass/fail is a property of CPython alone, independent of our runtime. Requires py3.14+. Deats, - four scenarios via `--scenario`: - `control_subint_thread_fork` — the KNOWN-BROKEN case as a harness sanity; if the child DOESN'T abort, our analysis is wrong - `main_thread_fork` — baseline sanity, must always succeed - `worker_thread_fork` — architectural assertion: regular `threading.Thread` attached to main interp calls `os.fork()`; child should survive post-fork cleanup - `full_architecture` — end-to-end: fork from a main-interp worker thread, then in child create a subint driving a worker thread running `trio.run()` - exit code 0 on EXPECTED outcome (for `control_*` that means "child aborted", not "child succeeded") - each scenario prints a self-contained pass/fail banner; use `os.waitpid()` of the parent + per-scenario status prints to observe the child's fate Also, log NLNet provenance for this session's three-sub-phase work (py3.13 gate tightening, `pytest-timeout` + marker refactor, `subint_fork` prototype → CPython-block finding). Prompt-IO: ai/prompt-io/claude/20260422T200723Z_797f57c_prompt_io.md (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:48:34 -04:00
Gud Boi	0f48ed2eb9	Doc `subint_fork` as blocked by CPython post-fork Empirical finding: the WIP `subint_fork_proc` scaffold landed in `cf0e3e6f` does not work on current CPython. The `fork()` syscall succeeds in the parent, but the CHILD aborts immediately during `PyOS_AfterFork_Child()` → `_PyInterpreterState_DeleteExceptMain()`, which gates on the current tstate belonging to the main interp — the child dies with `Fatal Python error: not main interpreter`. CPython devs acknowledge the fragility with an in-source comment (`// Ideally we could guarantee tstate is running main.`) but expose no user-facing hook to satisfy the precondition — so the strategy is structurally dead until upstream changes. Rather than delete the scaffold, reshape it into a documented dead-end so the next person with this idea lands on the reason rather than rediscovering the same CPython-level refusal. Deats, - Move `subint_fork_proc` out of `tractor.spawn._subint` into a new `tractor.spawn._subint_fork` dedicated module (153 LOC). Module + fn docstrings now describe the blockage directly; the fn body is trimmed to a `NotImplementedError` pointing at the analysis doc — no more dead-code `bootstrap` sketch bloating `_subint.py`. - `_spawn.py`: keep `'subint_fork'` in `SpawnMethodKey` + the `_methods` dispatch so `--spawn-backend=subint_fork` routes to a clean `NotImplementedError` rather than "invalid backend"; comment calls out the blockage. Collapse the duplicate py3.14 feature-gate in `try_set_start_method()` into a combined `case 'subint' \| 'subint_fork':` arm. - New 337-line analysis: `ai/conc-anal/subint_fork_blocked_by_cpython_post_fork_issue.md`. Annotated walkthrough from the user-visible fatal error down to the specific `Modules/posixmodule.c` + `Python/pystate.c` source lines enforcing the refusal, plus an upstream-report draft. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:48:06 -04:00
Gud Boi	f3cea714bc	Expand `subint` sigint-starvation hang catalog Add two more tests to the catalog in `conc-anal/subint_sigint_starvation_issue.md` — same signal-wakeup-fd-saturation fingerprint (abandoned legacy-subint driver threads → shared-GIL starvation → `write() = EAGAIN` on the wakeup pipe → silent SIGINT drop), different load patterns. Deats, - `test_cancel_while_childs_child_in_sync_sleep[subint-False]`: nested actor-tree + sync-sleeping grandchild. Under `trio`/`mp_*` the "zombie reaper" is a subproc `SIGKILL`; no equivalent exists under subint, so the grandchild persists in its abandoned driver thread. Often only manifests under full-suite runs (earlier tests seed the abandoned-thread pool). - `test_multierror_fast_nursery[subint-25-0.5]`: 25 concurrent subactors all go through teardown on the multierror. Bounded hard-kills run in parallel — so the total budget is ~3s, not 3s × 25. Leaves 25 abandoned driver threads simultaneously alive, an extreme pressure multiplier. `strace` shows several successful `write(16, "\2", 1) = 1` (GIL round-robin IS giving main brief slices) before finally saturating with `EAGAIN`. Also include a `pstree -snapt <pid>` capture showing 16+ live `{subint-driver[<interp_id>}` threads at the moment of hang — the direct GIL-contender population. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:47:49 -04:00
Gud Boi	a65fded4c6	Add prompt-io log for `subint` hang-class docs Log the `claude-opus-4-7` collab that produced `e92e3cd2` ("Doc `subint` backend hang classes + arm `dump_on_hang`"). Substantive bc the two new `ai/conc-anal/` docs were jointly authored — user framed the two-class split + set candidate-fix ordering for the class-2 (Ctrl-C-able) hang; claude drafted the prose and the test-side cross-linking comments. `.raw.md` is in diff-ref mode — per-file pointers via `git diff e92e3cd2~1..e92e3cd2 -- <path>` rather than re-embedding content that already lives in `git log -p`. Prompt-IO: ai/prompt-io/claude/20260420T192739Z_5e8cd8b2_prompt_io.md (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:47:49 -04:00
Gud Boi	4a3254583b	Doc `subint` backend hang classes + arm `dump_on_hang` Classify and write up the two distinct hang modes hit during Phase B subint bringup (issue #379) so future triage doesn't re-derive them from scratch. Deats, two new `ai/conc-anal/` docs, - `subint_sigint_starvation_issue.md`: abandoned legacy-subint thread + shared GIL → main trio loop starves → signal-wakeup-fd pipe fills → `SIGINT` silently dropped (`strace` shows `write() = EAGAIN` on the wakeup-fd). Un- Ctrl-C-able. Structurally a CPython limit; blocked on `msgspec` PEP 684 (jcrist/msgspec#563) - `subint_cancel_delivery_hang_issue.md`: parent-side trio task parks on an orphaned IPC channel after subint teardown — no clean EOF delivered to the waiting receive. Ctrl-C-able (main loop iterates fine); OUR bug to fix. Candidate fix: explicit parent-side channel abort in `subint_proc`'s hard-kill teardown Cross-link the docs from their test reproducers, - `test_stale_entry_is_deleted` (→ starvation class): wrap `trio.run(main)` in `dump_on_hang(seconds=20)` so a future regression captures a stack dump. Kept un- skipped so the dump file is inspectable - `test_subint_non_checkpointing_child` (→ delivery class): extend docstring with a "KNOWN ISSUE" block pointing at the analysis (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:47:49 -04:00
Gud Boi	c041518bdb	Add prompt-IO log for subint destroy-race fix Log the `claude-opus-4-7` session that produced the `_subint.py` dedicated-thread fix (`26fb8206`). Substantive bc the patch was entirely AI-generated; raw log also preserves the CPython-internals research informing Phase B.3 hard-kill work. Prompt-IO: ai/prompt-io/claude/20260418T042526Z_26fb820_prompt_io.md (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:47:49 -04:00
Gud Boi	b8f243e98d	Impl min-viable `subint` spawn backend (B.2) Replace the B.1 scaffold stub w/ a working spawn flow driving PEP 734 sub-interpreters on dedicated OS threads. Deats, - use private `_interpreters` C mod (not the public `concurrent.interpreters` API) to get `'legacy'` subint config — avoids PEP 684 C-ext compat issues w/ `msgspec` and other deps missing the `Py_mod_multiple_interpreters` slot - bootstrap subint via code-string calling new `_actor_child_main()` from `_child.py` (shared entry for both CLI and subint backends) - drive subint lifetime on an OS thread using `trio.to_thread.run_sync(_interpreters.exec, ..)` - full supervision lifecycle mirrors `trio_proc`: `ipc_server.wait_for_peer()` → send `SpawnSpec` → yield `Portal` via `task_status.started()` - graceful shutdown awaits the subint's inner `trio.run()` completing; cancel path sends `portal.cancel_actor()` then waits for thread join before `_interpreters.destroy()` Also, - extract `_actor_child_main()` from `_child.py` `__main__` block as callable entry shape bc the subint needs it for code-string bootstrap - add `"subint"` to the `_runtime.py` spawn-method check so child accepts `SpawnSpec` over IPC Prompt-IO: ai/prompt-io/claude/20260417T124437Z_5cd6df5_prompt_io.md (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:47:49 -04:00
Gud Boi	a7b1ee34ef	Restore fn-arg `_runtime_vars` in `trio_proc` teardown During the Phase A extraction of `trio_proc()` out of `spawn._spawn` into its own submod, the `debug.maybe_wait_for_debugger(child_in_debug=...)` call site in the hard-reap `finally` got refactored from the original `_runtime_vars.get('_debug_mode', ...)` (the fn parameter — the dict that was constructed by the parent for the child's `SpawnSpec`) to `get_runtime_vars().get(...)` (a global getter that returns the parent's live `_state`). Those are semantically different — the first asks "is the child we just spawned in debug mode?", the second asks "are we in debug mode?". Under mixed-debug-mode trees the swap can incorrectly skip (or unnecessarily delay) the debugger-lock wait during teardown. Revert to the fn-parameter lookup and add an inline `NOTE` comment calling out the distinction so it's harder to regress again. Deats, - `spawn/_trio.py`: `child_in_debug=get_runtime_vars().get(...)` → `child_in_debug=_runtime_vars.get(...)` at the `debug.maybe_wait_for_debugger(...)` call in the hard-reap block; add 4-line `NOTE` explaining the parent-vs-child distinction. - `spawn/__init__.py`: drop trailing whitespace after the `'mp_forkserver'` docstring bullet. - `ai/prompt-io/prompts/subints_spawner.md`: drop duplicated `with` in `"as with with subprocs"` prose (copilot grammar catch). Review: PR #444 (Copilot) https://github.com/goodboy/tractor/pull/444#pullrequestreview-4165928469 (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:30:11 -04:00
Gud Boi	e0b8f23cbc	Add prompt-io files for "phase-A", fix typos caught by copilot	2026-04-17 18:26:41 -04:00
Gud Boi	b5b0504918	Add prompt-IO log for subint spawner design kickoff Log the `claude-opus-4-7` design session that produced the phased plan (A: modularize `_spawn`, B: `_subint` backend, C: harness) and concrete Phase A file-split for #379. Substantive bc the plan directly drives upcoming impl. Prompt-IO: ai/prompt-io/claude/20260417T034918Z_9703210_prompt_io.md (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-17 16:48:22 -04:00
Gud Boi	de78a6445b	Initial prompt to vibe subint support Bo	2026-04-17 16:48:18 -04:00
Gud Boi	3152f423d8	Condense `.raw.md` prompt-IO logs, add `diff_cmd` refs Replace verbose inline code dumps in `.raw.md` entries with terse summaries and `git diff` cmd references. Add `diff_cmd` metadata to each entry's YAML frontmatter so readers can reproduce the actual output diff. Also, - rename `multiaddr_declare_eps.md_` -> `.md` (drop trailing `_` suffix) (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-16 17:44:14 -04:00
Gud Boi	ccb013a615	Add `prefer_addr()` transport selection to `_api` New locality-aware addr preference for multihomed actors: UDS > local TCP > remote TCP. Uses `ipaddress` + `socket.getaddrinfo()` to detect whether a `TCPAddress` is on the local host. Deats, - `_is_local_addr()` checks loopback or same-host IPs via interface enumeration - `prefer_addr()` classifies an addr list into three tiers and picks the latest entry from the highest-priority non-empty tier - `query_actor()` and `wait_for_actor()` now call `prefer_addr()` instead of grabbing `addrs[-1]` or a single pre-selected addr Also, - `Registrar.find_actor()` returns full `list[UnwrappedAddress]\|None` so callers can apply transport preference Prompt-IO: ai/prompt-io/claude/20260414T163300Z_befedc49_prompt_io.md (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-14 19:54:14 -04:00
Gud Boi	e90241baaa	Add `parse_endpoints()` to `_multiaddr` Provide a service-table parsing API for downstream projects (like `piker`) to declare per-actor transport bind addresses as a config map of actor-name -> multiaddr strings (e.g. from a TOML `[network]` section). Deats, - `EndpointsTable` type alias: input `dict[str, list[str\|tuple]]`. - `ParsedEndpoints` type alias: output `dict[str, list[Address]]`. - `parse_endpoints()` iterates the table and delegates each entry to the existing `tractor.discovery._discovery.wrap_address()` helper, which handles maddr strings, raw `(host, port)` tuples, and pre-wrapped `Address` objs. - UDS maddrs use the multiaddr spec name `/unix/...` (not tractor's internal `/uds/` proto_key) Also add new tests, - 7 new pure unit tests (no trio runtime): TCP-only, mixed tpts, unwrapped tuples, mixed str+tuple, unsupported proto (`/udp/`), empty table, empty actor list - all 22 multiaddr tests pass rn. Prompt-IO: ai/prompt-io/claude/20260413T205048Z_269d939c_prompt_io.md (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-14 19:54:14 -04:00
Gud Boi	7079a597c5	Add `test_tpt_bind_addrs.py` + fix type-mixing bug Add 9 test variants (6 fns) covering all three `tpt_bind_addrs` code paths in `open_root_actor()`: - registrar w/ explicit bind (eq, subset, disjoint) - non-registrar w/ explicit bind (same/diff bindspace) using `daemon` fixture - non-registrar default random bind (baseline) - maddr string input parsing - registrar merge produces union - `open_nursery()` forwards `tpt_bind_addrs` Fix type-mixing bug at `_root.py:446` where the registrar merge path did `set(Address + tuple)`, preventing dedup and causing double-bind `OSError`. Wrap `uw_reg_addrs` before the set union so both sides are `Address` objs. Also, - add prompt-io output log for this session - stage original prompt input for tracking Prompt-IO: ai/prompt-io/claude/20260413T192116Z_f851f28_prompt_io.md (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-14 19:54:14 -04:00
Gud Boi	cd1cd03725	Add prompt-io log for `run_ctx` teardown analysis Documents the diagnostic session tracing why per-`ctx_key` locking alone doesn't close the `_Cache.run_ctx` teardown race — the lock pops in the exiting caller's task but resource cleanup runs in the `run_ctx` task inside `service_tn`. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-09 14:42:42 -04:00
Gud Boi	cab366cd65	Add xfail test for `_Cache.run_ctx` teardown race Reproduce the piker `open_cached_client('kraken')` scenario: identical `ctx_key` callers share one cached resource, and a new task re-enters during `__aexit__` — hitting `assert not resources.get()` bc `values` was popped but `resources` wasn't yet. Deats, - `test_moc_reentry_during_teardown` uses an `in_aexit` event to deterministically land in the teardown window. - marked `xfail(raises=AssertionError)` against unpatched code (fix in `9e49eddd` or wtv lands on the `maybe_open_ctx_locking` or thereafter patch branch). Also, add prompt-io log for the session. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code Prompt-IO: ai/prompt-io/claude/20260406T193125Z_85f9c5d_prompt_io.md	2026-04-06 18:17:04 -04:00
Gud Boi	85f9c5df6f	Add per-`ctx_key` isolation tests for `maybe_open_context()` Add `test_per_ctx_key_resource_lifecycle` to verify that per-key user tracking correctly tears down resources independently - exercises the fix from 02b2ef18 where a global `_Cache.users` counter caused stale cache hits when the same `acm_func` was called with different kwargs. Also, add a paired `acm_with_resource()` helper `@acm` that yields its `resource_id` for per-key testing in the above suite. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code Prompt-IO: ai/prompt-io/claude/20260406T172848Z_02b2ef1_prompt_io.md	2026-04-06 14:37:47 -04:00

30 Commits (d6e70e9de48ce564df19f76401c5664708257fef)