tractor

Commit Graph

Author	SHA1	Message	Date
Gud Boi	eceed29d4a	Pin forkserver hang to pytest `--capture=fd` Sixth and final diagnostic pass — after all 4 cascade fixes landed (FD hygiene, pidfd wait, `_parent_chan_cs` wiring, bounded peer-clear), the actual last gate on `test_nested_multierrors[subint_forkserver]` turned out to be pytest's default `--capture=fd` stdout/stderr capture, not anything in the runtime cascade. Empirical result: `pytest -s` → test PASSES in 6.20s. Default `--capture=fd` → hangs forever. Mechanism: pytest replaces the parent's fds 1,2 with pipe write-ends it reads from. Fork children inherit those pipes (since `_close_inherited_fds` correctly preserves stdio). The error-propagation cascade in a multi-level cancel test generates 7+ actors each logging multiple `RemoteActorError` / `ExceptionGroup` tracebacks — enough output to fill Linux's 64KB pipe buffer. Writes block, subactors can't progress, processes don't exit, `_ForkedProc.wait` hangs. Self-critical aside: I earlier tested w/ and w/o `-s` and both hung, concluding "capture-pipe ruled out". That was wrong — at that time fixes 1-4 weren't all in place, so the test was failing at deeper levels long before reaching the "produce lots of output" phase. Once the cascade could actually tear down cleanly, enough output flowed to hit the pipe limit. Order-of- operations mistake: ruling something out based on a test that was failing for a different reason. Deats, - `subint_forkserver_test_cancellation_leak_issue .md`: new section "Update — VERY late: pytest capture pipe IS the final gate" w/ DIAG timeline showing `trio.run` fully returns, diagnosis of pipe-fill mechanism, retrospective on the earlier wrong ruling-out, and fix direction (redirect subactor stdout/stderr to `/dev/null` in fork-child prelude, conditional on pytest-detection or opt-in flag) - `tests/test_cancellation.py`: skip-mark reason rewritten to describe the capture-pipe gate specifically; cross-refs the new doc section - `tests/spawn/test_subint_forkserver.py`: the orphan-SIGINT test regresses back to xfail. Previously passed after the FD-hygiene fix, but the new `wait_for_no_more_peers( move_on_after=3.0)` bound in `async_main`'s teardown added up to 3s latency, pushing orphan-subactor exit past the test's 10s poll window. Real fix: faster orphan-side teardown OR extend poll window to 15s No runtime code changes in this commit — just test-mark adjustments + doc wrap-up. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 23:18:14 -04:00
Gud Boi	e312a68d8a	Bound peer-clear wait in `async_main` finally Fifth diagnostic pass pinpointed the hang to `async_main`'s finally block — every stuck actor reaches `FINALLY ENTER` but never `RETURNING`. Specifically `await ipc_server.wait_for_no_more_ peers()` never returns when a peer-channel handler is stuck: the `_no_more_peers` Event is set only when `server._peers` empties, and stuck handlers keep their channels registered. Wrap the call in `trio.move_on_after(3.0)` + a warning-log on timeout that records the still- connected peer count. 3s is enough for any graceful cancel-ack round-trip; beyond that we're in bug territory and need to proceed with local teardown so the parent's `_ForkedProc.wait()` can unblock. Defensive-in-depth regardless of the underlying bug — a local finally shouldn't block on remote cooperation forever. Verified: with this fix, ALL 15 actors reach `async_main: RETURNING` (up from 10/15 before). Test still hangs past 45s though — there's at least one MORE unbounded wait downstream of `async_main`. Candidates enumerated in the doc update (`open_root_actor` finally / `actor.cancel()` internals / trio.run bg tasks / `_serve_ipc_eps` finally). Skip-mark stays on `test_nested_multierrors[subint_forkserver]`. Also updates `subint_forkserver_test_cancellation_leak_issue.md` with the new pinpoint + summary of the 6-item investigation win list: 1. FD hygiene fix (`_close_inherited_fds`) — orphan-SIGINT closed 2. pidfd-based `_ForkedProc.wait` — cancellable 3. `_parent_chan_cs` wiring — shielded parent-chan loop now breakable 4. `wait_for_no_more_peers` bound — THIS commit 5. Ruled-out hypotheses: tree-kill missing, stuck socket recv, capture-pipe fill (all wrong) 6. Remaining unknown: at least one more unbounded wait in the teardown cascade above `async_main` (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 22:34:49 -04:00
Gud Boi	4d0555435b	Narrow forkserver hang to `async_main` outer tn Fourth diagnostic pass — instrument `_worker`'s fork-child branch (`pre child_target()` / `child_ target RETURNED rc=N` / `about to os._exit(rc)`) and `_trio_main` boundaries (`about to trio.run` / `trio.run RETURNED NORMALLY` / `FINALLY`). Test config: depth=1/breadth=2 = 1 root + 14 forked = 15 actors total. Fresh-run results, - 9 processes complete the full flow: `trio.run RETURNED NORMALLY` → `child_target RETURNED rc=0` → `os._exit(0)`. These are tree LEAVES (errorers) plus their direct parents (depth-0 spawners) — they actually exit - 5 processes stuck INSIDE `trio.run(trio_ main)`: hit "about to trio.run" but never see "trio.run RETURNED NORMALLY". These are root + top-level spawners + one intermediate The deadlock is in `async_main` itself, NOT the peer-channel loops. Specifically, the outer `async with root_tn:` in `async_main` never exits for the 5 stuck actors, so the cascade wedges: trio.run never returns → _trio_main finally never runs → _worker never reaches os._exit(rc) → process never dies → parent's _ForkedProc.wait() blocks → parent's nursery hangs → parent's async_main hangs → (recurse up) The precise new question: what task in the 5 stuck actors' `async_main` never completes? Candidates: 1. shielded parent-chan `process_messages` task in `root_tn` — but we cancel it via `_parent_chan_cs.cancel()` in `Actor.cancel()`, which only runs during `open_root_actor.__aexit__`, which itself runs only after `async_main`'s outer unwind — which doesn't happen. So the shield isn't broken in this path. 2. `actor_nursery._join_procs.wait()` or similar inline in the backend `*_proc` flow. 3. `_ForkedProc.wait()` on a grandchild that DID exit — but pidfd_open watch didn't fire (race between `pidfd_open` and the child exiting?). Most specific next probe: add DIAG around `_ForkedProc.wait()` enter/exit to see whether pidfd-based wait returns for every grandchild exit. If a stuck parent's `_ForkedProc.wait()` never returns despite its child exiting → pidfd mechanism has a race bug under nested forkserver. Asymmetry observed in the cascade tree: some d=0 spawners exit cleanly, others stick, even though they started identically. Not purely depth- determined — some race condition in nursery teardown when multiple siblings error simultaneously. No code changes — diagnosis-only. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 21:36:19 -04:00
Gud Boi	ab86f7613d	Refine `subint_forkserver` cancel-cascade diag Third diagnostic pass on `test_nested_multierrors[subint_forkserver]` hang. Two prior hypotheses ruled out + a new, more specific deadlock shape identified. Ruled out, - capture-pipe fill (`-s` flag changes test): retested explicitly — `test_nested_multierrors` hangs identically with and without `-s`. The earlier observation was likely a competing pytest process I had running in another session holding registry state - stuck peer-chan recv that cancel can't break: pivot from the prior pass. With `handle_stream_from_peer` instrumented at ENTER / `except trio.Cancelled:` / finally: 40 ENTERs, ZERO `trio.Cancelled` hits. Cancel never reaches those tasks at all — the recvs are fine, nothing is telling them to stop Actual deadlock shape: multi-level mutual wait. root blocks on spawner.wait() spawner blocks on grandchild.wait() grandchild blocks on errorer.wait() errorer Actor.cancel() ran, but proc never exits `Actor.cancel()` fired in 12 PIDs — but NOT in root + 2 direct spawners. Those 3 have peer handlers stuck because their own `Actor.cancel()` never runs, which only runs when the enclosing `tractor.open_nursery()` exits, which waits on `_ForkedProc.wait()` for the child pidfd to signal, which only signals when the child process fully exits. Refined question: why does an errorer process not exit after its `Actor.cancel()` completes? Three hypotheses (unverified): 1. `_parent_chan_cs.cancel()` fires but the shielded loop's recv is stuck in a way cancel still can't break 2. `async_main`'s post-cancel unwind has other tasks in `root_tn` awaiting something that never arrives (e.g. outbound IPC reply) 3. `os._exit(rc)` in `_worker` never runs because `_child_target` never returns Next-session probes (priority order): 1. instrument `_worker`'s fork-child branch — confirm whether `child_target()` returns / `os._exit(rc)` is reached for errorer PIDs 2. instrument `async_main`'s final unwind — see which await in teardown doesn't complete 3. compare under `trio_proc` backend at the equivalent level to spot divergence No code changes — diagnosis-only. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 21:23:11 -04:00
Gud Boi	458a35cf09	Surface silent failures in `_subint_forkserver` Three places that previously swallowed exceptions silently now log via `log.exception()` so they surface in the runtime log when something weird happens — easier to track down sneaky failures in the fork-from-worker-thread / subint-bootstrap primitives. Deats, - `_close_inherited_fds()`: post-fork child's per-fd `os.close()` swallow now logs the fd that failed to close. The comment notes the expected failure modes (already-closed-via-listdir-race, otherwise-unclosable) — both still fine to ignore semantically, but worth flagging in the log. - `fork_from_worker_thread()` parent-side timeout branch: the `os.close(rfd)` + `os.close(wfd)` cleanup now logs each pipe-fd close failure separately before raising the `worker thread didn't return` RuntimeError. - `run_subint_in_worker_thread._drive()`: when `_interpreters.exec(interp_id, bootstrap)` raises a `BaseException`, log the full call signature (interp_id + bootstrap) along with the captured exception, before stashing into `err` for the outer caller. Behavior unchanged — only adds observability. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:48:34 -04:00
Gud Boi	7cd47ef7fb	Doc ruled-out fix + capture-pipe aside Two new sections in `subint_forkserver_test_cancellation_leak_issue.md` documenting continued investigation of the `test_nested_multierrors[subint_forkserver]` peer- channel-loop hang: 1. "Attempted fix (DID NOT work) — hypothesis (3)": tried sync-closing peer channels' raw socket fds from `_serve_ipc_eps`'s finally block (iterate `server._peers`, `_chan._transport. stream.socket.close()`). Theory was that sync close would propagate as `EBADF` / `ClosedResourceError` into the stuck `recv_some()` and unblock it. Result: identical hang. Either trio holds an internal fd reference that survives external close, or the stuck recv isn't even the root blocker. Either way: ruled out, experiment reverted, skip-mark restored. 2. "Aside: `-s` flag changes behavior for peer- intensive tests": noticed `test_context_stream_semantics.py` under `subint_forkserver` hangs with default `--capture=fd` but passes with `-s` (`--capture=no`). Working hypothesis: subactors inherit pytest's capture pipe (fds 1,2 — which `_close_inherited_fds` deliberately preserves); verbose subactor logging fills the buffer, writes block, deadlock. Fix direction (if confirmed): redirect subactor stdout/stderr to `/dev/null` or a file in `_actor_child_main`. Not a blocker on the main investigation; deserves its own mini-tracker. Both sections are diagnosis-only — no code changes in this commit. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:48:34 -04:00
Gud Boi	76d12060aa	Claude-perms: ensure /commit-msg files can be written!	2026-04-23 18:48:34 -04:00
Gud Boi	506617c695	Skip-mark + narrow `subint_forkserver` cancel hang Two-part stopgap for the still-hanging `test_nested_multierrors[subint_forkserver]`: 1. Skip-mark the test via `@pytest.mark.skipon_spawn_backend('subint_forkserver', reason=...)` so it stops blocking the test matrix while the remaining bug is being chased. The reason string cross-refs the conc-anal doc for full context. 2. Update the conc-anal doc (`subint_forkserver_test_cancellation_leak_issue.md`) with the empirical state after the three nested- cancel fix commits (`0cd0b633` FD scrub + `fe540d02` pidfd wait + `57935804` parent-chan shield break) landed, narrowing the remaining hang from "everything broken" to "peer-channel loops don't exit on `service_tn` cancel". Deats from the DIAGDEBUG instrumentation pass, - 80 `process_messages` ENTERs, 75 EXITs → 5 stuck - ALL 40 `shield=True` ENTERs matched EXIT — the `_parent_chan_cs.cancel()` wiring from `57935804` works as intended for shielded loops. - the 5 stuck loops are all `shield=False` peer- channel handlers in `handle_stream_from_peer` (inbound connections handled by `stream_handler_tn`, which IS `service_tn` in the current config). - after `_parent_chan_cs.cancel()` fires, NEW shielded loops appear on the session reg_addr port — probably discovery-layer reconnection; doesn't block teardown but indicates the cascade has more moving parts than expected. The remaining unknown: why don't the 5 peer-channel loops exit when `service_tn.cancel_scope.cancel()` fires? They're not shielded, they're inside the service_tn scope, a standard cancel should propagate through. Some fork-config-specific divergence keeps them alive. Doc lists three follow-up experiments (stackscope dump, side-by-side `trio_proc` comparison, audit of the `tractor/ipc/_server.py:448` `except trio.Cancelled:` path). (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:48:34 -04:00
Gud Boi	8ac3dfeb85	Break parent-chan shield during teardown Completes the nested-cancel deadlock fix started in `0cd0b633` (fork-child FD scrub) and `fe540d02` (pidfd- cancellable wait). The remaining piece: the parent- channel `process_messages` loop runs under `shield=True` (so normal cancel cascades don't kill it prematurely), and relies on EOF arriving when the parent closes the socket to exit naturally. Under exec-spawn backends (`trio_proc`, mp) that EOF arrival is reliable — parent's teardown closes the handler-task socket deterministically. But fork- based backends like `subint_forkserver` share enough process-image state that EOF delivery becomes racy: the loop parks waiting for an EOF that only arrives after the parent finishes its own teardown, but the parent is itself blocked on `os.waitpid()` for THIS actor's exit. Mutual wait → deadlock. Deats, - `async_main` stashes the cancel-scope returned by `root_tn.start(...)` for the parent-chan `process_messages` task onto the actor as `_parent_chan_cs` - `Actor.cancel()`'s teardown path (after `ipc_server.cancel()` + `wait_for_shutdown()`) calls `self._parent_chan_cs.cancel()` to explicitly break the shield — no more waiting for EOF delivery, unwinding proceeds deterministically regardless of backend - inline comments on both sites explain the mutual- wait deadlock + why the explicit cancel is backend-agnostic rather than a forkserver-specific workaround With this + the prior two fixes, the `subint_forkserver` nested-cancel cascade unwinds cleanly end-to-end. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:48:34 -04:00
Gud Boi	c20b05e181	Use `pidfd` for cancellable `_ForkedProc.wait` Two coordinated improvements to the `subint_forkserver` backend: 1. Replace `trio.to_thread.run_sync(os.waitpid, ..., abandon_on_cancel=False)` in `_ForkedProc.wait()` with `trio.lowlevel.wait_readable(pidfd)`. The prior version blocked a trio cache thread on a sync syscall — outer cancel scopes couldn't unwedge it when something downstream got stuck. Same pattern `trio.Process.wait()` and `proc_waiter` (the mp backend) already use. 2. Drop the `@pytest.mark.xfail(strict=True)` from `test_orphaned_subactor_sigint_cleanup_DRAFT` — the test now PASSES after `0cd0b633` (fork-child FD scrub). Same root cause as the nested-cancel hang: inherited IPC/trio FDs were poisoning the child's event loop. Closing them lets SIGINT propagation work as designed. Deats, - `_ForkedProc.__init__` opens a pidfd via `os.pidfd_open(pid)` (Linux 5.3+, Python 3.9+) - `wait()` parks on `trio.lowlevel.wait_readable()`, then non-blocking `waitpid(WNOHANG)` to collect the exit status (correct since the pidfd signal IS the child-exit notification) - `ChildProcessError` swallow handles the rare race where someone else reaps first - pidfd closed after `wait()` completes (one-shot semantics) + `__del__` belt-and-braces for unexpected-teardown paths - test docstring's `@xfail` block replaced with a `# NOTE` comment explaining the historical context + cross-ref to the conc-anal doc; test remains in place as a regression guard The two changes are interdependent — the cancellable `wait()` matters for the same nested- cancel scenarios the FD scrub fixes, since the original deadlock had trio cache workers wedged in `os.waitpid` swallowing the outer cancel. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:48:34 -04:00
Gud Boi	9993db0193	Scrub inherited FDs in fork-child prelude Implements fix-direction (1)/blunt-close-all-FDs from `b71705bd` (`subint_forkserver` nested-cancel hang diag), targeting the multi-level cancel-cascade deadlock in `test_nested_multierrors[subint_forkserver]`. The diagnosis doc voted for surgical FD cleanup via `actor.ipc_server` handle as the cleanest approach, but going blunt is actually the right call: after `os.fork()`, the child immediately enters `_actor_child_main()` which opens its OWN IPC sockets / wakeup-fd / epoll-fd / etc. — none of the parent's FDs are needed. Closing everything except stdio is safe AND defends against future listener/IPC additions to the parent inheriting silently into children. Deats, - new `_close_inherited_fds(keep={0,1,2}) -> int` helper. Linux fast-path enumerates `/proc/self/fd`; POSIX fallback uses `RLIMIT_NOFILE` range. Matches the stdlib `subprocess._posixsubprocess.close_fds` strategy. Returns close-count for sanity logging - wire into `fork_from_worker_thread._worker()`'s post-fork child prelude — runs immediately after the pid-pipe `os.close(rfd/wfd)`, before the user `child_target` callable executes - docstring cross-refs the diagnosis doc + spells out the FD-inheritance-cascade mechanism and why the close-all approach is safe for our spawn shape Validation pending: re-run `test_nested_multierrors[subint_forkserver]` to confirm the deadlock is gone. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:48:34 -04:00
Gud Boi	35da808905	Refine `subint_forkserver` nested-cancel hang diagnosis Major rewrite of `subint_forkserver_test_cancellation_leak_issue.md` after empirical investigation revealed the earlier "descendant-leak + missing tree-kill" diagnosis conflated two unrelated symptoms: 1. 5-zombie leak holding `:1616` — turned out to be a self-inflicted cleanup bug: `pkill`-ing a bg pytest task (SIGTERM/SIGKILL, no SIGINT) skipped the SC graceful cancel cascade entirely. Codified the real fix — SIGINT-first ladder w/ bounded wait before SIGKILL — in `e5e2afb5` (`run-tests` SKILL) and `feedback_sc_graceful_cancel_first.md`. 2. `test_nested_multierrors[subint_forkserver]` hangs indefinitely — the actual backend bug, and it's a deadlock not a leak. Deats, - new diagnosis: all 5 procs are kernel-`S` in `do_epoll_wait`; pytest-main's trio-cache workers are in `os.waitpid` waiting for children that are themselves waiting on IPC that never arrives — graceful `Portal.cancel_actor` cascade never reaches its targets - tree-structure evidence: asymmetric depth across two identical `run_in_actor` calls — child 1 (3 threads) spawns both its grandchildren; child 2 (1 thread) never completes its first nursery `run_in_actor`. Smells like a race on fork- inherited state landing differently per spawn ordering - new hypothesis: `os.fork()` from a subactor inherits the ROOT parent's IPC listener FDs transitively. Grandchildren end up with three overlapping FD sets (own + direct-parent + root), so IPC routing becomes ambiguous. Predicts bug scales with fork depth — matches reality: single- level spawn works, multi-level hangs - ruled out: `_ForkedProc.kill()` tree-kill (never reaches hard-kill path), `:1616` contention (fixed by `reg_addr` fixture wiring), GIL starvation (each subactor has its own OS process+GIL), child-side KBI absorption (`_trio_main` only catches KBI at `trio.run()` callsite, reached only on trio-loop exit) - four fix directions ranked: (1) blanket post-fork `closerange()`, (2) `FD_CLOEXEC` + audit, (3) targeted FD cleanup via `actor.ipc_server` handle, (4) `os.posix_spawn` w/ `file_actions`. Vote: (3) — surgical, doesn't break the "no exec" design of `subint_forkserver` - standalone repro added (`spawn_and_error(breadth= 2, depth=1)` under `trio.fail_after(20)`) - stopgap: skip `test_nested_multierrors` + multi- level-spawn tests under the backend via `@pytest.mark.skipon_spawn_backend(...)` until fix lands Killing the "tree-kill descendants" fix-direction section: it addressed a bug that didn't exist. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:48:34 -04:00
Gud Boi	70d58c4bd2	Use SIGINT-first ladder in `run-tests` cleanup The previous cleanup recipe went straight to SIGTERM+SIGKILL, which hides bugs: tractor is structured concurrent — `_trio_main` catches SIGINT as an OS-cancel and cascades `Portal.cancel_actor` over IPC to every descendant. So a graceful SIGINT exercises the actual SC teardown path; if it hangs, that's a real bug to file (the forkserver `:1616` zombie was originally suspected to be one of these but turned out to be a teardown gap in `_ForkedProc.kill()` instead). Deats, - step 1: `pkill -INT` scoped to `$(pwd)/py*` — no sleep yet, just send the signal - step 2: bounded wait loop (10 × 0.3s = ~3s) using `pgrep` to poll for exit. Loop breaks early on clean exit - step 3: `pkill -9` only if graceful timed out, w/ a logged escalation msg so it's obvious when SC teardown didn't complete - step 4: same SIGINT-first ladder for the rare `:1616`-holding zombie that doesn't match the cmdline pattern (find PID via `ss -tlnp`, then `kill -INT NNNN; sleep 1; kill -9 NNNN`) - steps 5-6: UDS-socket `rm -f` + re-verify unchanged Goal: surface real teardown bugs through the test- cleanup workflow instead of papering over them with `-9`. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:48:34 -04:00
Gud Boi	1af2121057	Wire `reg_addr` through leaky cancel tests Stopgap companion to `d0121960` (`subint_forkserver` test-cancellation leak doc): five tests in `tests/test_cancellation.py` were running against the default `:1616` registry, so any leaked `subint-forkserv` descendant from a prior test holds the port and blows up every subsequent run with `TooSlowError` / "address in use". Thread the session-unique `reg_addr` fixture through so each run picks its own port — zombies can no longer poison other tests (they'll only cross-contaminate whatever happens to share their port, which is now nothing). Deats, - add `reg_addr: tuple` fixture param to: - `test_cancel_infinite_streamer` - `test_some_cancels_all` - `test_nested_multierrors` - `test_cancel_via_SIGINT` - `test_cancel_via_SIGINT_other_task` - explicitly pass `registry_addrs=[reg_addr]` to the two `open_nursery()` calls that previously had no kwargs at all (in `test_cancel_via_SIGINT` and `test_cancel_via_SIGINT_other_task`) - add bounded `@pytest.mark.timeout(7, method='thread')` to `test_nested_multierrors` so a hung run doesn't wedge the whole session Still doesn't close the real leak — the `subint_forkserver` backend's `_ForkedProc.kill()` is PID-scoped not tree-scoped, so grandchildren survive teardown regardless of registry port. This commit is just blast-radius containment until that fix lands. See `ai/conc-anal/ subint_forkserver_test_cancellation_leak_issue.md`. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:48:34 -04:00
Gud Boi	e3f4f5a387	Add `subint_forkserver` test-cancellation leak doc New `ai/conc-anal/ subint_forkserver_test_cancellation_leak_issue.md` captures a descendant-leak surfaced while wiring `subint_forkserver` into the full test matrix: running `tests/test_cancellation.py` under `--spawn-backend=subint_forkserver` reproducibly leaks exactly 5 `subint-forkserv` comm-named child processes that survive session exit, each holding a `LISTEN` on `:1616` (the tractor default registry addr) — and therefore poisons every subsequent test session that defaults to that addr. Deats, - TL;DR + ruled-out checks confirming the procs are ours (not piker / other tractor-embedding apps) — `/proc/$pid/cmdline` + cwd both resolve to this repo's `py314/` venv - root cause: `_ForkedProc.kill()` is PID-scoped (plain `os.kill(SIGKILL)` to the direct child), not tree-scoped — grandchildren spawned during a multi-level cancel test get reparented to init and inherit the registry listen socket - proposed fix directions ranked: (1) put each forkserver-spawned subactor in its own process- group (`os.setpgrp()` in fork-child) + tree-kill via `os.killpg(pgid, SIGKILL)` on teardown, (2) `PR_SET_CHILD_SUBREAPER` on root, (3) explicit `/proc/<pid>/task//children` walk. Vote: (1) — POSIX-standard, aligns w/ `start_new_session=True` semantics in `subprocess.Popen` / trio's `open_process` - inline reproducer + cleanup recipe scoped to `$(pwd)/py314/bin/python.pytest.*spawn-backend= subint_forkserver` so cleanup doesn't false-flag unrelated tractor procs (consistent w/ `run-tests` skill's zombie-check guidance) Stopgap hygiene fix (wiring `reg_addr` through the 5 leaky tests in `test_cancellation.py`) is incoming as a follow-up — that one stops the blast radius, but zombies still accumulate per-run until the real tree-kill fix lands. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:48:34 -04:00
Gud Boi	d093c31979	Add zombie-actor check to `run-tests` skill Fork-based backends (esp. `subint_forkserver`) can leak child actor processes on cancelled / SIGINT'd test runs; the zombies keep the tractor default registry (`127.0.0.1:1616` / `/tmp/registry@1616.sock`) bound, so every subsequent session can't bind and 50+ unrelated tests fail with the same `TooSlowError` / "address in use" signature. Document the pre-flight + post-cancel check as a mandatory step 4. Deats, - primary signal: `ss -tlnp \| grep ':1616'` for a bound TCP registry listener — the authoritative check since :1616 is unique to our runtime - `pgrep -af` scoped to `$(pwd)/py[0-9]/bin/python. _actor_child_main\|subint-forkserv` for leftover actor/forkserver procs — scoped deliberately so we don't false-flag legit long-running tractor- embedding apps like `piker` - `ls /tmp/registry@.sock` for stale UDS sockets - scoped cleanup recipe (SIGTERM + SIGKILL sweep using the same `$(pwd)/py` pattern, UDS `rm -f`, re-verify) plus a fallback for when a zombie holds :1616 but doesn't match the pattern: `ss -tlnp` → kill by PID - explicit false-positive warning calling out the `piker` case (`~/repos/piker/py*/bin/python3 -m tractor._child ...`) so a bare `pgrep` doesn't lead to nuking unrelated apps Goal: short-circuit the "spelunking into test code" rabbit-hole when the real cause is just a leaked PID from a prior session, without collateral damage to other tractor-embedding projects on the same box. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:48:34 -04:00
Gud Boi	1e357dcf08	Mv `test_subint_cancellation.py` to `tests/spawn/` subpkg Also, some slight touchups in `.spawn._subint`.	2026-04-23 18:48:34 -04:00
Gud Boi	e31eb8d7c9	Label forkserver child as `subint_forkserver` Follow-up to `72d1b901` (was prev commit adding `debug_mode` for `subint_forkserver`): that commit wired the runtime-side `subint_forkserver` SpawnSpec-recv gate in `Actor._from_parent`, but the `subint_forkserver_proc` child-target was still passing `spawn_method='trio'` to `_trio_main` — so `Actor.pformat()` / log lines would report the subactor as plain `'trio'` instead of the actual parent-side spawn mechanism. Flip the label to `'subint_forkserver'`. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:48:34 -04:00
Gud Boi	8bcbe730bf	Enable `debug_mode` for `subint_forkserver` The `subint_forkserver` backend's child runtime is trio-native (uses `_trio_main` + receives `SpawnSpec` over IPC just like `trio`/`subint`), so `tractor.devx.debug._tty_lock` works in those subactors. Wire the runtime gates that historically hard-coded `_spawn_method == 'trio'` to recognize this third backend. Deats, - new `_DEBUG_COMPATIBLE_BACKENDS` module-const in `tractor._root` listing the spawn backends whose subactor runtime is trio-native (`'trio'`, `'subint_forkserver'`). Both the enable-site (`_runtime_vars['_debug_mode'] = True`) and the cleanup-site reset key. off the same tuple — keep them in lockstep when adding backends - `open_root_actor`'s `RuntimeError` for unsupported backends now reports the full compatible-set + the rejected method instead of the stale "only `trio`" msg. - `runtime._runtime.Actor._from_parent`'s SpawnSpec-recv gate adds `'subint_forkserver'` to the existing `('trio', 'subint')` tuple — fork child-side runtime receives the same SpawnSpec IPC handshake as the others. - `subint_forkserver_proc` child-target now passes `spawn_method='subint_forkserver'` (was hard-coded `'trio'`) so `Actor.pformat()` / log lines reflect the actual parent-side spawn mechanism rather than masquerading as plain `trio`. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:48:34 -04:00
Gud Boi	5e85f184e0	Drop unneeded f-str prefixes	2026-04-23 18:48:34 -04:00
Gud Boi	f5f37b69e6	Shorten some timeouts in `subint_forkserver` suites	2026-04-23 18:48:34 -04:00
Gud Boi	a72deef709	Refine `subint_forkserver` orphan-SIGINT diagnosis Empirical follow-up to the xfail'd orphan-SIGINT test: the hang is not "trio can't install a handler on a non-main thread" (the original hypothesis from the `child_sigint` scaffold commit). On py3.14: - `threading.current_thread() is threading.main_thread()` IS True post-fork — CPython re-designates the fork-inheriting thread as "main" correctly - trio's `KIManager` SIGINT handler IS installed in the subactor (`signal.getsignal(SIGINT)` confirms) - the kernel DOES deliver SIGINT to the thread But `faulthandler` dumps show the subactor wedged in `trio/_core/_io_epoll.py::get_events` — trio's wakeup-fd mechanism (which turns SIGINT into an epoll-wake) isn't firing. So the `except KeyboardInterrupt` at `tractor/spawn/_entry.py::_trio_main:164` — the runtime's intentional "KBI-as-OS-cancel" path — never fires. Deats, - new `ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md` (+385 LOC): full writeup — TL;DR, symptom reproducer, the "intentional cancel path" the bug defeats, diagnostic evidence (`faulthandler` output + `getsignal` probe), ruled-out hypotheses (non-main-thread issue, wakeup-fd inheritance, KBI-as-trio-check-exception), and fix directions - `test_orphaned_subactor_sigint_cleanup_DRAFT` xfail `reason` + test docstring rewritten to match the refined understanding — old wording blamed the non-main-thread path, new wording points at the `epoll_wait` wedge + cross-refs the new conc-anal doc - `_subint_forkserver` module docstring's `child_sigint='trio'` bullet updated: now notes trio's handler is already correctly installed, so the flag may end up a no-op / doc-only mode once the real root cause is fixed Closing the gap aligns with existing design intent (make the already-designed "KBI-as-OS-cancel" behavior actually fire), not a new feature. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:48:34 -04:00
Gud Boi	dcd5c1ff40	Scaffold `child_sigint` modes for forkserver Add configuration surface for future child-side SIGINT plumbing in `subint_forkserver_proc` without wiring up the actual trio-native SIGINT bridge — lifting one entry-guard clause will flip the `'trio'` branch live once the underlying fork-prelude plumbing is implemented. Deats, - new `ChildSigintMode = Literal['ipc', 'trio']` type + `_DEFAULT_CHILD_SIGINT = 'ipc'` module-level default. Docstring block enumerates both: - `'ipc'` (default, currently the only implemented mode): no child-side SIGINT handler — `trio.run()` is on the fork-inherited non-main thread where `signal.set_wakeup_fd()` is main-thread-only, so cancellation flows exclusively via the parent's `Portal.cancel_actor()` IPC path. Known gap: orphan children don't respond to SIGINT (`test_orphaned_subactor_sigint_cleanup_DRAFT`) - `'trio'` (scaffolded only): manual SIGINT → trio-cancel bridge in the fork-child prelude so external Ctrl-C reaches stuck grandchildren even w/ a dead parent - `subint_forkserver_proc` pulls `child_sigint` out of `proc_kwargs` (matches how `trio_proc` threads config to `open_process`, keeps `start_actor(proc_kwargs=...)` as the ergonomic entry point); validates membership + raises `NotImplementedError` for `'trio'` at the backend-entry guard - `_child_target` grows a `match child_sigint:` arm that slots in the future `'trio'` impl without restructuring — today only the `'ipc'` case is reachable - module docstring "Still-open work" list grows a bullet pointing at this config + the xfail'd orphan-SIGINT test No behavioral change on the default path — `'ipc'` is the existing flow. Scaffolding only. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:48:34 -04:00
Gud Boi	76605d5609	Add DRAFT `subint_forkserver` orphan-SIGINT test Tier-4 test `test_orphaned_subactor_sigint_cleanup_DRAFT` documents an empirical SIGINT-delivery gap in the `subint_forkserver` backend: when the parent dies via `SIGKILL` (no IPC `Portal.cancel_actor()` possible) and `SIGINT` is sent to the orphan child, the child DOES NOT unwind — CPython's default `KeyboardInterrupt` is delivered to `threading.main_thread()`, whose tstate is dead in the post-fork child bc fork inherited the worker thread, not main. Trio running on the fork-inherited worker thread therefore never observes the signal. Marked `xfail(strict=True)` so the mark flips to XPASS→fail once the backend grows explicit SIGINT plumbing. Deats, - harness runs the failure-mode sequence out-of-process: 1. harness subprocess runs a fresh Python script that calls `try_set_start_method('subint_forkserver')` then opens a root actor + one `sleep_forever` subactor 2. parse `PARENT_READY=<pid>` + `CHILD_PID=<pid>` markers off harness `stdout` to confirm IPC handshake completed 3. `SIGKILL` the parent, `proc.wait()` to reap the zombie (otherwise `os.kill(pid, 0)` keeps reporting it alive) 4. assert the child survived the parent-reap (i.e. was actually orphaned, not reaped too) before moving on 5. `SIGINT` the orphan child, poll `os.kill(child_pid, 0)` every 100ms for up to 10s - supporting helpers: `_read_marker()` with per-proc bytes-buffer to carry partial lines across calls, `_process_alive()` liveness probe via `kill(pid, 0)` - Linux-only via `platform.system() != 'Linux'` skip — orphan-reparenting semantics don't generalize to other platforms - port offset (`reg_addr[1] + 17`) so the harness listener doesn't race concurrently-running backend tests - best-effort `finally:` cleanup: `SIGKILL` any still-alive pids + `proc.kill()` + bounded `proc.wait()` to avoid leaking orphans across the session Also, tier-4 header comment documents the cross-backend generalization path: applicable to any multi-process backend (`trio`, `mp_spawn`, `mp_forkserver`, `subint_forkserver`), NOT to plain `subint` (in-process subints have no orphan OS-child). Move path: lift harness into `tests/_orphan_harness.py`, parametrize on session `_spawn_method`, add `skipif _spawn_method == 'subint'`. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:48:34 -04:00
Gud Boi	7804a9feac	Refactor `_runtime_vars` into pure get/set API Post-fork `_runtime_vars` reset in `subint_forkserver_proc` was previously done via direct mutation of `_state._runtime_vars` from an external module + an inline default dict duplicating the `_state.py`-internal defaults. Split the access surface into a pure getter + explicit setter so the reset call site becomes a one-liner composition. Deats `tractor/runtime/_state.py`, - extract initial values into a module-level `_RUNTIME_VARS_DEFAULTS: dict[str, Any]` constant; the live `_runtime_vars` is now initialised from `dict(_RUNTIME_VARS_DEFAULTS)` - `get_runtime_vars()` grows a `clear_values: bool = False` kwarg. When True, returns a fresh copy of `_RUNTIME_VARS_DEFAULTS` instead of the live dict — still a pure read, never mutates anything - new `set_runtime_vars(rtvars: dict \| RuntimeVars)` — atomic replacement of the live dict's contents via `.clear()` + `.update()`, so existing references to the same dict object remain valid. Accepts either the historical dict form or the `RuntimeVars` struct Deats `tractor/spawn/_subint_forkserver.py`, - collapse the prior ad-hoc `.update({...})` block into `set_runtime_vars(get_runtime_vars(clear_values=True))` - drop the `_state._current_actor = None` line — `_trio_main` unconditionally overwrites it downstream, so no explicit reset needed (noted in the XXX comment) (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:48:34 -04:00
Gud Boi	63ab7c986b	Reset post-fork `_state` in forkserver child `os.fork()` inherits the parent's entire memory image, including `tractor.runtime._state` globals that encode "this process is the root actor" — `_runtime_vars`'s `_is_root=True`, pre-populated `_root_mailbox` + `_registry_addrs`, and the parent's `_current_actor` singleton. A fresh `exec`-based child starts with those globals at their module-level defaults (all falsey/empty). The forkserver child needs to match that shape BEFORE calling `_actor_child_main()`, otherwise `Actor.__init__()` takes the `is_root_process() == True` branch and pre-populates `self.enable_modules`, which then trips `assert not self.enable_modules` at the top of `Actor._from_parent()` on the subsequent parent→child `SpawnSpec` handshake. Fix: at the start of `_child_target`, null `_state._current_actor` and overwrite `_runtime_vars` with a cold-root blank (`_is_root=False`, empty mailbox/addrs, `_debug_mode=False`) before `_actor_child_main()` runs. Found-via: `test_subint_forkserver_spawn_basic` hitting the `enable_modules` assert on child-side runtime boot. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:48:34 -04:00
Gud Boi	26914fde75	Wire `subint_forkserver` as first-class backend Promote `_subint_forkserver` from primitives-only into a registered spawn backend: `'subint_forkserver'` is now a `SpawnMethodKey` literal, dispatched via `_methods` to the new `subint_forkserver_proc()` target, feature-gated under the existing `subint`-family py3.14+ case, and selectable via `--spawn-backend=subint_forkserver`. Deats, - new `subint_forkserver_proc()` spawn target in `_subint_forkserver`: - mirrors `trio_proc()`'s supervision model — real OS subprocess so `Portal.cancel_actor()` + `soft_kill()` on graceful teardown, `os.kill(SIGKILL)` on hard-reap (no `_interpreters.destroy()` race to fuss over bc the child lives in its own process) - only real diff from `trio_proc` is the spawn mechanism: fork from a main-interp worker thread via `fork_from_worker_thread()` (off-loaded to trio's thread pool) instead of `trio.lowlevel.open_process()` - child-side `_child_target` closure runs `tractor._child._actor_child_main()` with `spawn_method='trio'` — the child is a regular trio actor, "subint_forkserver" names how the parent spawned, not what the child runs - new `_ForkedProc` class — thin `trio.Process`-compatible shim around a raw OS pid: `.poll()` via `waitpid(WNOHANG)`, async `.wait()` off-loaded to a trio cache thread, `.kill()` via `SIGKILL`, `.returncode` cached for repeat calls. `.stdin`/`.stdout`/`.stderr` are `None` (fork-w/o-exec inherits parent FDs; we don't marshal them) which matches `soft_kill()`'s `is not None` guards Also, new backend-tier test `test_subint_forkserver_spawn_basic` drives the registered backend end-to-end via `open_root_actor` + `open_nursery` + `run_in_actor` w/ a trivial portal-RPC round-trip. Uses a `forkserver_spawn_method` fixture to flip `_spawn_method`/`_ctx` for the test's duration + restore on teardown (so other session-level tests don't observe the global flip). Test module docstring reworked to describe the three tiers now covered: (1) primitive-level, (2) parent-trio-driven primitives, (3) full registered backend. Status: still-open work (tracked on `tractor#379`) doc'd inline in the module docstring — no cancel/hard-kill stress coverage yet, child-side subint-hosted root runtime still future (gated on `msgspec#563`), thread-hygiene audit pending the same unblock. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:48:34 -04:00
Gud Boi	cf2e71d87f	Add `subint_forkserver` PEP 684 audit-plan doc Follow-up tracker companion to the module-docstring TODO added in `372a0f32`. Catalogs why `_subint_forkserver`'s two "non-trio thread" constraints (`fork_from_worker_thread()` + `run_subint_in_worker_thread()` both allocating dedicated `threading.Thread`s; test helper named `run_fork_in_non_trio_thread`) exist today, and which of them would dissolve once msgspec PEP 684 support ships (`msgspec#563`) and tractor flips to isolated-mode subints. Deats, - three reasons enumerated for the current constraints: - class-A GIL-starvation — fixed by isolated mode: subints don't share main's GIL so abandoned-thread contention disappears - destroy race / tstate-recycling from `subint_proc` — unclear: `_PyXI_Enter` + `_PyXI_Exit` are cross-mode, so isolated doesn't obviously fix it; needs empirical retest on py3.14 + isolated API - fork-from-main-interp-tstate (the CPython-level `_PyInterpreterState_DeleteExceptMain` gate) — the narrow reason for using a dedicated thread; probably fixed IF the destroy-race also resolves (bc trio's cache threads never drove subints → clean main-interp tstate) - TL;DR table of which constraints unwind under each resolution branch - four-step audit plan for when `msgspec#563` lands: - flip `_subint` to isolated mode - empirical destroy-race retest - audit `_subint_forkserver.py` — drop `non_trio` qualifier / maybe inline primitives - doc fallout — close the three `subint_*_issue.md` siblings w/ post-mortem notes Also, cross-refs the three sibling `conc-anal/` docs, PEPs 684 + 734, `msgspec#563`, and `tractor#379` (the overall subint spawn-backend tracking issue). (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:48:34 -04:00
Gud Boi	25e400d526	Add trio-parent tests for `_subint_forkserver` New pytest module `tests/spawn/test_subint_forkserver.py` drives the forkserver primitives from inside a real `trio.run()` in the parent — the runtime shape tractor will actually use when we wire up a `subint_forkserver` spawn backend proper. Complements the standalone no-trio-in-parent `ai/conc-anal/subint_fork_from_main_thread_smoketest.py`. Deats, - new test pkg `tests/spawn/` (+ empty `__init__.py`) - two tests, both `@pytest.mark.timeout(30, method='thread')` for the GIL-hostage safety reason doc'd in `ai/conc-anal/subint_sigint_starvation_issue.md`: - `test_fork_from_worker_thread_via_trio` — parent-side plumbing baseline. `trio.run()` off-loads forkserver prims via `trio.to_thread.run_sync()` + asserts the child reaps cleanly - `test_fork_and_run_trio_in_child` — end-to-end: forked child calls `run_subint_in_worker_thread()` with a bootstrap str that does `trio.run()` in a fresh subint - both tests wrap the inner `trio.run()` in a `dump_on_hang()` for post-mortem if the outer `pytest-timeout` fires - intentionally NOT using `--spawn-backend` — the tests drive the primitives directly rather than going through tractor's spawn-method registry (which the forkserver isn't plugged into yet) Also, rename `run_trio_in_subint()` → `run_subint_in_worker_thread()` for naming consistency with the sibling `fork_from_worker_thread()`. The action is really "host a subint on a worker thread", not specifically "run trio" — trio just happens to be the typical payload. Propagate the rename to the smoketest. Further, add a "TODO — cleanup gated on msgspec PEP 684 support" section to the `_subint_forkserver` module docstring: flags the dedicated-`threading.Thread` design as potentially-revisable once isolated-mode subints are viable in tractor. Cross-refs `msgspec#563` + `tractor#379` and points at an audit-plan conc-anal doc we'll add next. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:48:34 -04:00
Gud Boi	82332fbceb	Lift fork prims into `_subint_forkserver` mod The smoketest (prior commit) empirically validated the "fork-from-main-interp-worker-thread" arch on py3.14. Promote the validated primitives out of the `ai/conc-anal/` smoketest into `tractor.spawn._subint_forkserver` so they can eventually be wired into a real "subint forkserver" spawn backend. Deats, - new module `tractor/spawn/_subint_forkserver.py` (337 LOC): - `fork_from_worker_thread(child_target, thread_name)` — spawn a main-interp `threading.Thread`, call `os.fork()` from it, shuttle the child pid back to main via a pipe - `run_trio_in_subint(bootstrap, ...)` — post-fork helper: create a fresh subint + drive `_interpreters.exec()` on a dedicated worker thread running the `bootstrap` str (typically imports `trio`, defines an async entry, calls `trio.run()`) - `wait_child(pid, expect_exit_ok)` — `os.waitpid()` + pass/fail classification reusable from harness AND the eventual real spawn path - feature-gated py3.14+ via the public `concurrent.interpreters` presence check; matches the gate in `tractor.spawn._subint` - module docstring doc's the CPython-block context (cross-refs `_subint_fork` stub + the two `conc-anal/` docs) and status: EXPERIMENTAL, not yet registered in `_spawn._methods` Also, refactor the smoketest `ai/conc-anal/subint_fork_from_main_thread_smoketest.py` to import the primitives from the new module rather than inline its own copies. Keeps the smoketest and the tractor-side impl in sync as the forkserver design evolves; the smoketest remains a zero-`tractor`-runtime CPython-level check (imports ONLY the three primitives, no runtime bring-up). Status: next step is to drive these from a parent-side `trio.run()` and hook the returned child pid into the normal actor-nursery/IPC flow — then register `subint_forkserver` as a `SpawnMethodKey` in `_spawn.py`. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:48:34 -04:00
Gud Boi	de4f470b6c	Add CPython-level `subint_fork` workaround smoketest Standalone script to validate the "main-interp worker-thread forkserver + subint-hosted trio" arch proposed as a workaround to the CPython-level refusal doc'd in `ai/conc-anal/subint_fork_blocked_by_cpython_post_fork_issue.md`. Deliberately NOT a `tractor` test — zero `tractor` imports. Uses `_interpreters` (private stdlib) + `os.fork()` directly so pass/fail is a property of CPython alone, independent of our runtime. Requires py3.14+. Deats, - four scenarios via `--scenario`: - `control_subint_thread_fork` — the KNOWN-BROKEN case as a harness sanity; if the child DOESN'T abort, our analysis is wrong - `main_thread_fork` — baseline sanity, must always succeed - `worker_thread_fork` — architectural assertion: regular `threading.Thread` attached to main interp calls `os.fork()`; child should survive post-fork cleanup - `full_architecture` — end-to-end: fork from a main-interp worker thread, then in child create a subint driving a worker thread running `trio.run()` - exit code 0 on EXPECTED outcome (for `control_*` that means "child aborted", not "child succeeded") - each scenario prints a self-contained pass/fail banner; use `os.waitpid()` of the parent + per-scenario status prints to observe the child's fate Also, log NLNet provenance for this session's three-sub-phase work (py3.13 gate tightening, `pytest-timeout` + marker refactor, `subint_fork` prototype → CPython-block finding). Prompt-IO: ai/prompt-io/claude/20260422T200723Z_797f57c_prompt_io.md (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:48:34 -04:00
Gud Boi	0f48ed2eb9	Doc `subint_fork` as blocked by CPython post-fork Empirical finding: the WIP `subint_fork_proc` scaffold landed in `cf0e3e6f` does not work on current CPython. The `fork()` syscall succeeds in the parent, but the CHILD aborts immediately during `PyOS_AfterFork_Child()` → `_PyInterpreterState_DeleteExceptMain()`, which gates on the current tstate belonging to the main interp — the child dies with `Fatal Python error: not main interpreter`. CPython devs acknowledge the fragility with an in-source comment (`// Ideally we could guarantee tstate is running main.`) but expose no user-facing hook to satisfy the precondition — so the strategy is structurally dead until upstream changes. Rather than delete the scaffold, reshape it into a documented dead-end so the next person with this idea lands on the reason rather than rediscovering the same CPython-level refusal. Deats, - Move `subint_fork_proc` out of `tractor.spawn._subint` into a new `tractor.spawn._subint_fork` dedicated module (153 LOC). Module + fn docstrings now describe the blockage directly; the fn body is trimmed to a `NotImplementedError` pointing at the analysis doc — no more dead-code `bootstrap` sketch bloating `_subint.py`. - `_spawn.py`: keep `'subint_fork'` in `SpawnMethodKey` + the `_methods` dispatch so `--spawn-backend=subint_fork` routes to a clean `NotImplementedError` rather than "invalid backend"; comment calls out the blockage. Collapse the duplicate py3.14 feature-gate in `try_set_start_method()` into a combined `case 'subint' \| 'subint_fork':` arm. - New 337-line analysis: `ai/conc-anal/subint_fork_blocked_by_cpython_post_fork_issue.md`. Annotated walkthrough from the user-visible fatal error down to the specific `Modules/posixmodule.c` + `Python/pystate.c` source lines enforcing the refusal, plus an upstream-report draft. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:48:06 -04:00
Gud Boi	eee79a0357	Add WIP `subint_fork_proc` backend scaffold Experimental third spawn backend: use a fresh sub-interpreter purely as a trio-free launchpad from which to `os.fork()` + exec back into `python -m tractor._child`. Per issue #379's "fork()-workaround/hacks" thread. Intent is to sidestep both, - the trio+fork hazards hitting `trio_proc` (python- trio/trio#1614 et al.), since the forking interp is guaranteed trio-free. - the shared-GIL abandoned-thread hazards hitting `subint_proc` (`ai/conc-anal/subint_sigint_starvation_issue.md`), since we don't stay in the subint — it only lives long enough to call `os.fork()` Downstream of the fork+exec, all the existing `trio_proc` plumbing is reused verbatim: `ipc_server.wait_for_peer()`, `SpawnSpec`, `Portal` yield, soft-kill. Status: NOT wired up beyond scaffolding. The fn raises `NotImplementedError` immediately; the `bootstrap` fork/exec string builder and the `# TODO: orchestrate driver thread` block are kept in-tree as deliberate dead code so the next iteration starts from a concrete shape rather than a blank page. Docstring calls out three open questions that need empirical validation before wiring this up: 1. Does CPython permit `os.fork()` from a non-main legacy subint? 2. Can the child stay fork-without-exec and `trio.run()` directly from within the launchpad subint? 3. How do `signal.set_wakeup_fd()` handlers and other process-global state interact when the forking thread is inside a subint? (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:48:06 -04:00
Gud Boi	4b2a0886c3	Mark `subint`-hanging tests with `skipon_spawn_backend` Adopt the `@pytest.mark.skipon_spawn_backend('subint', reason=...)` marker (`a617b521`) across the suites reproducing the `subint` GIL-contention / starvation hang classes doc'd in `ai/conc-anal/subint_*_issue.md`. Deats, - Module-level `pytestmark` on full-file-hanging suites: - `tests/test_cancellation.py` - `tests/test_inter_peer_cancellation.py` - `tests/test_pubsub.py` - `tests/test_shm.py` - Per-test decorator where only one test in the file hangs: - `tests/discovery/test_registrar.py ::test_stale_entry_is_deleted` — replaces the inline `if start_method == 'subint': pytest.skip` branch with a declarative skip. - `tests/test_subint_cancellation.py ::test_subint_non_checkpointing_child`. - A few per-test decorators are left commented-in- place as breadcrumbs for later finer-grained unskips. Also, some nearby tidying in the affected files: - Annotate loose fixture / test params (`pytest.FixtureRequest`, `str`, `tuple`, `bool`) in `tests/conftest.py`, `tests/devx/conftest.py`, and `tests/test_cancellation.py`. - Normalize `"""..."""` → `'''...'''` docstrings per repo convention on a few touched tests. - Add `timeout=6` / `timeout=10` to `@tractor_test(...)` on `test_cancel_infinite_streamer` and `test_some_cancels_all`. - Drop redundant `spawn_backend` param from `test_cancel_via_SIGINT`; use `start_method` in the `'mp' in ...` check instead. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:47:49 -04:00
Gud Boi	3b26b59dad	Add `skipon_spawn_backend` pytest marker A reusable `@pytest.mark.skipon_spawn_backend( '<backend>' [, ...], reason='...')` marker for backend-specific known-hang / -borked cases — avoids scattering `@pytest.mark.skipif(lambda ...)` branches across tests that misbehave under a particular `--spawn-backend`. Deats, - `pytest_configure()` registers the marker via `addinivalue_line('markers', ...)`. - New `pytest_collection_modifyitems()` hook walks each collected item with `item.iter_markers( name='skipon_spawn_backend')`, checks whether the active `--spawn-backend` appears in `mark.args`, and if so injects a concrete `pytest.mark.skip( reason=...)`. `iter_markers()` makes the decorator work at function, class, or module (`pytestmark = [...]`) scope transparently. - First matching mark wins; default reason is `f'Borked on --spawn-backend={backend!r}'` if the caller doesn't supply one. Also, tighten type annotations on nearby `pytest` integration points — `pytest_configure`, `debug_mode`, `spawn_backend`, `tpt_protos`, `tpt_proto` — now taking typed `pytest.Config` / `pytest.FixtureRequest` params. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:47:49 -04:00
Gud Boi	f3cea714bc	Expand `subint` sigint-starvation hang catalog Add two more tests to the catalog in `conc-anal/subint_sigint_starvation_issue.md` — same signal-wakeup-fd-saturation fingerprint (abandoned legacy-subint driver threads → shared-GIL starvation → `write() = EAGAIN` on the wakeup pipe → silent SIGINT drop), different load patterns. Deats, - `test_cancel_while_childs_child_in_sync_sleep[subint-False]`: nested actor-tree + sync-sleeping grandchild. Under `trio`/`mp_*` the "zombie reaper" is a subproc `SIGKILL`; no equivalent exists under subint, so the grandchild persists in its abandoned driver thread. Often only manifests under full-suite runs (earlier tests seed the abandoned-thread pool). - `test_multierror_fast_nursery[subint-25-0.5]`: 25 concurrent subactors all go through teardown on the multierror. Bounded hard-kills run in parallel — so the total budget is ~3s, not 3s × 25. Leaves 25 abandoned driver threads simultaneously alive, an extreme pressure multiplier. `strace` shows several successful `write(16, "\2", 1) = 1` (GIL round-robin IS giving main brief slices) before finally saturating with `EAGAIN`. Also include a `pstree -snapt <pid>` capture showing 16+ live `{subint-driver[<interp_id>}` threads at the moment of hang — the direct GIL-contender population. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:47:49 -04:00
Gud Boi	985ea76de5	Skip `test_stale_entry_is_deleted` hanger with `subint`s	2026-04-23 18:47:49 -04:00
Gud Boi	5998774535	Add global 200s `pytest-timeout`	2026-04-23 18:47:49 -04:00
Gud Boi	a6cbac954d	Bump lock-file for `pytest-timeout` + 3.13 gated wheel-deps	2026-04-23 18:47:49 -04:00
Gud Boi	189f4e3ffc	Wall-cap `subint` audit tests via `pytest-timeout` Add a hard process-level wall-clock bound on the two known-hanging subint-backend tests so an unattended suite run can't wedge indefinitely in either of the hang classes doc'd in `ai/conc-anal/`. Deats, - New `testing` dep: `pytest-timeout>=2.3`. - `test_stale_entry_is_deleted`: `@pytest.mark.timeout(3, method='thread')`. The `method='thread'` choice is deliberate — `method='signal'` routes via `SIGALRM` which is starved by the same GIL-hostage path that drops `SIGINT` (see `subint_sigint_starvation_issue.md`), so it'd never actually fire in the starvation case. - `test_subint_non_checkpointing_child`: same decorator, same reasoning (defense-in-depth over the inner `trio.fail_after(15)`). At timeout, `pytest-timeout` hard-kills the pytest process itself — that's the intended behavior here; the alternative is the suite never returning. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:47:49 -04:00
Gud Boi	a65fded4c6	Add prompt-io log for `subint` hang-class docs Log the `claude-opus-4-7` collab that produced `e92e3cd2` ("Doc `subint` backend hang classes + arm `dump_on_hang`"). Substantive bc the two new `ai/conc-anal/` docs were jointly authored — user framed the two-class split + set candidate-fix ordering for the class-2 (Ctrl-C-able) hang; claude drafted the prose and the test-side cross-linking comments. `.raw.md` is in diff-ref mode — per-file pointers via `git diff e92e3cd2~1..e92e3cd2 -- <path>` rather than re-embedding content that already lives in `git log -p`. Prompt-IO: ai/prompt-io/claude/20260420T192739Z_5e8cd8b2_prompt_io.md (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:47:49 -04:00
Gud Boi	4a3254583b	Doc `subint` backend hang classes + arm `dump_on_hang` Classify and write up the two distinct hang modes hit during Phase B subint bringup (issue #379) so future triage doesn't re-derive them from scratch. Deats, two new `ai/conc-anal/` docs, - `subint_sigint_starvation_issue.md`: abandoned legacy-subint thread + shared GIL → main trio loop starves → signal-wakeup-fd pipe fills → `SIGINT` silently dropped (`strace` shows `write() = EAGAIN` on the wakeup-fd). Un- Ctrl-C-able. Structurally a CPython limit; blocked on `msgspec` PEP 684 (jcrist/msgspec#563) - `subint_cancel_delivery_hang_issue.md`: parent-side trio task parks on an orphaned IPC channel after subint teardown — no clean EOF delivered to the waiting receive. Ctrl-C-able (main loop iterates fine); OUR bug to fix. Candidate fix: explicit parent-side channel abort in `subint_proc`'s hard-kill teardown Cross-link the docs from their test reproducers, - `test_stale_entry_is_deleted` (→ starvation class): wrap `trio.run(main)` in `dump_on_hang(seconds=20)` so a future regression captures a stack dump. Kept un- skipped so the dump file is inspectable - `test_subint_non_checkpointing_child` (→ delivery class): extend docstring with a "KNOWN ISSUE" block pointing at the analysis (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:47:49 -04:00
Gud Boi	2ed5e6a6e8	Add `subint` cancellation + hard-kill test audit Lock in the escape-hatch machinery added to `tractor.spawn._subint` during the Phase B.2/B.3 bringup (issue #379) so future stdlib regressions or our own refactors don't silently re-introduce the mid-suite hangs. Deats, - `test_subint_happy_teardown`: baseline — spawn a subactor, one portal RPC, clean teardown. If this breaks, something's wrong unrelated to the hard-kill shields. - `test_subint_non_checkpointing_child`: cancel a subactor stuck in a non-checkpointing Python loop (`threading.Event.wait()` releases the GIL but never inserts a trio checkpoint). Validates the bounded-shield + daemon-driver-thread combo abandons the thread after `_HARD_KILL_TIMEOUT`. Every test is wrapped in `trio.fail_after()` for a deterministic per-test wall-clock ceiling (an unbounded audit would defeat itself) and arms `tractor.devx.dump_on_hang()` so a hang captures a stack dump — pytest's stderr capture swallows `faulthandler` output by default. Gated via `pytest.importorskip('concurrent.interpreters')` and a module-level skip when `--spawn-backend` isn't `'subint'`. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:47:49 -04:00
Gud Boi	34d9d482e4	Raise `subint` floor to py3.14 and split dep-groups The private `_interpreters` C module ships since 3.13, but that vintage wedges under our `threading.Thread` + multi-trio usage pattern —> `_interpreters.exec()` silently never makes progress. 3.14 fixes it. So gate on the presence of the public `concurrent.interpreters` wrapper (3.14+ only) even tho we still call into the private module at runtime. Deats, - `try_set_start_method('subint')` error msg + `_subint` module docstring/comments rewritten to document the 3.14 floor and why 3.13 can't work. - `_subint._has_subints` gate now imports `concurrent.interpreters` (not `_interpreters`) as the version sentinel. Also, reshuffle `pyproject.toml` deps into per-python-version `[tool.uv.dependency-groups]`: - `subints` group: `msgspec>=0.21.0`, py>=3.14 - `eventfd` group: `cffi>=1.17.1`, py>=3.13,<3.14 - `sync_pause` group: `greenback`, py>=3.13,<3.14 (was in `devx`; moved out bc no 3.14 yet) Bump top-level `msgspec>=0.20.0` too. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:47:49 -04:00
Gud Boi	09466a1e9d	Add `._debug_hangs` to `.devx` for hang triage Bottle up the diagnostic primitives that actually cracked the silent mid-suite hangs in the `subint` spawn-backend bringup (issue there" session has them on the shelf instead of reinventing from scratch. Deats, - `dump_on_hang(seconds, , path)` — context manager wrapping `faulthandler.dump_traceback_later()`. Critical gotcha baked in: dumps go to a file, not `sys.stderr`, bc pytest's stderr capture silently eats the output and you can spend an hour convinced you're looking at the wrong thing - `track_resource_deltas(label, , writer)` — context manager logging per-block `(threading.active_count(), len(_interpreters.list_all()))` deltas; quickly rules out leak-accumulation theories when a suite progressively worsens (if counts don't grow, it's not a leak, look for a race on shared cleanup instead) - `resource_delta_fixture(*, autouse, writer)` — factory returning a `pytest` fixture wrapping `track_resource_deltas` per-test; opt in by importing into a `conftest.py`. Kept as a factory (not a bare fixture) so callers own `autouse` / `writer` wiring Also, - export the three names from `tractor.devx` - dep-free on py<3.13 (swallows `ImportError` for `_interpreters`) - link back to the provenance in the module docstring (issue #379 / commit `26fb820`) (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:47:49 -04:00
Gud Boi	99541feec7	Bound subint teardown shields with hard-kill timeout Unbounded `trio.CancelScope(shield=True)` at the soft-kill and thread-join sites can wedge the parent trio loop indefinitely when a stuck subint ignores portal-cancel (e.g. bc the IPC channel is already broken). Deats, - add `_HARD_KILL_TIMEOUT` (3s) module-level const - wrap both shield sites with `trio.move_on_after()` so we abandon a stuck subint after the deadline - flip driver thread to `daemon=True` so proc-exit also isn't blocked by a wedged subint - pass `abandon_on_cancel=True` to `trio.to_thread.run_sync(driver_thread.join)` — load-bearing for `move_on_after` to actually fire - log warnings when either timeout triggers - improve `InterpreterError` log msg to explain the abandoned-thread scenario (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:47:49 -04:00
Gud Boi	c041518bdb	Add prompt-IO log for subint destroy-race fix Log the `claude-opus-4-7` session that produced the `_subint.py` dedicated-thread fix (`26fb8206`). Substantive bc the patch was entirely AI-generated; raw log also preserves the CPython-internals research informing Phase B.3 hard-kill work. Prompt-IO: ai/prompt-io/claude/20260418T042526Z_26fb820_prompt_io.md (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:47:49 -04:00
Gud Boi	31cbd11a5b	Fix subint destroy race via dedicated OS thread `trio.to_thread.run_sync(_interpreters.exec, ...)` runs `exec()` on a cached worker thread — and when that thread is returned to the cache after the subint's `trio.run()` exits, CPython still keeps the subint's tstate attached to the (now idle) worker. Result: the teardown `_interpreters.destroy(interp_id)` in the `finally` block can block the parent's trio loop indefinitely, waiting for a tstate release that only happens when the worker either picks up a new job or exits. Manifested as intermittent mid-suite hangs under `--spawn-backend=subint` — caught by a `faulthandler.dump_traceback_later()` showing the main thread stuck in `_interpreters.destroy()` at `_subint.py:293` with only an idle trio-cache worker as the other live thread. Deats, - drive the subint on a plain `threading.Thread` (not `trio.to_thread`) so the OS thread truly exits after `_interpreters.exec()` returns, releasing tstate and unblocking destroy - signal `subint_exited.set()` back to the parent trio loop from the driver thread via `trio.from_thread.run_sync(..., trio_token=...)` — capture the token at `subint_proc` entry - swallow `trio.RunFinishedError` in that signal path for the case where parent trio has already exited (proc teardown) - in the teardown `finally`, off-load the sync `driver_thread.join()` to `trio.to_thread.run_sync` (cache thread w/ no subint tstate → safe) so we actually wait for the driver to exit before `_interpreters.destroy()` (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:47:49 -04:00
Gud Boi	8a8d01e076	Doc the `_interpreters` private-API choice in `_subint` Expand the comment block above the `_interpreters` import explaining why we use the private C mod over `concurrent.interpreters`: the public API only exposes PEP 734's `'isolated'` config which breaks `msgspec` (missing PEP 684 slot). Add reference links to PEP 734, PEP 684, cpython sources, and the msgspec upstream tracker (jcrist/msgspec#563). Also, - update error msgs in both `_spawn.py` and `_subint.py` to say "3.13+" (matching the actual `_interpreters` availability) instead of "3.14+". - tweak the mod docstring to reflect py3.13+ availability via the private C module. Review: PR #444 (copilot-pull-request-reviewer) https://github.com/goodboy/tractor/pull/444 (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:47:49 -04:00
Gud Boi	03bf2b931e	Avoid skip `.ipc._ringbuf` import when no `cffi`	2026-04-23 18:47:49 -04:00

1 2 3 4 5 ...

2603 Commits (eceed29d4a6b04edb5b04c555666944aac3b79d9) All Branches Search

2603 Commits (eceed29d4a6b04edb5b04c555666944aac3b79d9)

All Branches