tractor

Commit Graph

Author	SHA1	Message	Date
Gud Boi	a4d6318ca7	Claude-perms: ensure /commit-msg files can be written!	2026-04-23 16:48:06 -04:00
Gud Boi	fe89169f1c	Skip-mark + narrow `subint_forkserver` cancel hang Two-part stopgap for the still-hanging `test_nested_multierrors[subint_forkserver]`: 1. Skip-mark the test via `@pytest.mark.skipon_spawn_backend('subint_forkserver', reason=...)` so it stops blocking the test matrix while the remaining bug is being chased. The reason string cross-refs the conc-anal doc for full context. 2. Update the conc-anal doc (`subint_forkserver_test_cancellation_leak_issue.md`) with the empirical state after the three nested- cancel fix commits (`0cd0b633` FD scrub + `fe540d02` pidfd wait + `57935804` parent-chan shield break) landed, narrowing the remaining hang from "everything broken" to "peer-channel loops don't exit on `service_tn` cancel". Deats from the DIAGDEBUG instrumentation pass, - 80 `process_messages` ENTERs, 75 EXITs → 5 stuck - ALL 40 `shield=True` ENTERs matched EXIT — the `_parent_chan_cs.cancel()` wiring from `57935804` works as intended for shielded loops. - the 5 stuck loops are all `shield=False` peer- channel handlers in `handle_stream_from_peer` (inbound connections handled by `stream_handler_tn`, which IS `service_tn` in the current config). - after `_parent_chan_cs.cancel()` fires, NEW shielded loops appear on the session reg_addr port — probably discovery-layer reconnection; doesn't block teardown but indicates the cascade has more moving parts than expected. The remaining unknown: why don't the 5 peer-channel loops exit when `service_tn.cancel_scope.cancel()` fires? They're not shielded, they're inside the service_tn scope, a standard cancel should propagate through. Some fork-config-specific divergence keeps them alive. Doc lists three follow-up experiments (stackscope dump, side-by-side `trio_proc` comparison, audit of the `tractor/ipc/_server.py:448` `except trio.Cancelled:` path). (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 16:44:15 -04:00
Gud Boi	57935804e2	Break parent-chan shield during teardown Completes the nested-cancel deadlock fix started in `0cd0b633` (fork-child FD scrub) and `fe540d02` (pidfd- cancellable wait). The remaining piece: the parent- channel `process_messages` loop runs under `shield=True` (so normal cancel cascades don't kill it prematurely), and relies on EOF arriving when the parent closes the socket to exit naturally. Under exec-spawn backends (`trio_proc`, mp) that EOF arrival is reliable — parent's teardown closes the handler-task socket deterministically. But fork- based backends like `subint_forkserver` share enough process-image state that EOF delivery becomes racy: the loop parks waiting for an EOF that only arrives after the parent finishes its own teardown, but the parent is itself blocked on `os.waitpid()` for THIS actor's exit. Mutual wait → deadlock. Deats, - `async_main` stashes the cancel-scope returned by `root_tn.start(...)` for the parent-chan `process_messages` task onto the actor as `_parent_chan_cs` - `Actor.cancel()`'s teardown path (after `ipc_server.cancel()` + `wait_for_shutdown()`) calls `self._parent_chan_cs.cancel()` to explicitly break the shield — no more waiting for EOF delivery, unwinding proceeds deterministically regardless of backend - inline comments on both sites explain the mutual- wait deadlock + why the explicit cancel is backend-agnostic rather than a forkserver-specific workaround With this + the prior two fixes, the `subint_forkserver` nested-cancel cascade unwinds cleanly end-to-end. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 16:27:38 -04:00
Gud Boi	fe540d0228	Use `pidfd` for cancellable `_ForkedProc.wait` Two coordinated improvements to the `subint_forkserver` backend: 1. Replace `trio.to_thread.run_sync(os.waitpid, ..., abandon_on_cancel=False)` in `_ForkedProc.wait()` with `trio.lowlevel.wait_readable(pidfd)`. The prior version blocked a trio cache thread on a sync syscall — outer cancel scopes couldn't unwedge it when something downstream got stuck. Same pattern `trio.Process.wait()` and `proc_waiter` (the mp backend) already use. 2. Drop the `@pytest.mark.xfail(strict=True)` from `test_orphaned_subactor_sigint_cleanup_DRAFT` — the test now PASSES after `0cd0b633` (fork-child FD scrub). Same root cause as the nested-cancel hang: inherited IPC/trio FDs were poisoning the child's event loop. Closing them lets SIGINT propagation work as designed. Deats, - `_ForkedProc.__init__` opens a pidfd via `os.pidfd_open(pid)` (Linux 5.3+, Python 3.9+) - `wait()` parks on `trio.lowlevel.wait_readable()`, then non-blocking `waitpid(WNOHANG)` to collect the exit status (correct since the pidfd signal IS the child-exit notification) - `ChildProcessError` swallow handles the rare race where someone else reaps first - pidfd closed after `wait()` completes (one-shot semantics) + `__del__` belt-and-braces for unexpected-teardown paths - test docstring's `@xfail` block replaced with a `# NOTE` comment explaining the historical context + cross-ref to the conc-anal doc; test remains in place as a regression guard The two changes are interdependent — the cancellable `wait()` matters for the same nested- cancel scenarios the FD scrub fixes, since the original deadlock had trio cache workers wedged in `os.waitpid` swallowing the outer cancel. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 16:06:45 -04:00
Gud Boi	0cd0b633f1	Scrub inherited FDs in fork-child prelude Implements fix-direction (1)/blunt-close-all-FDs from `b71705bd` (`subint_forkserver` nested-cancel hang diag), targeting the multi-level cancel-cascade deadlock in `test_nested_multierrors[subint_forkserver]`. The diagnosis doc voted for surgical FD cleanup via `actor.ipc_server` handle as the cleanest approach, but going blunt is actually the right call: after `os.fork()`, the child immediately enters `_actor_child_main()` which opens its OWN IPC sockets / wakeup-fd / epoll-fd / etc. — none of the parent's FDs are needed. Closing everything except stdio is safe AND defends against future listener/IPC additions to the parent inheriting silently into children. Deats, - new `_close_inherited_fds(keep={0,1,2}) -> int` helper. Linux fast-path enumerates `/proc/self/fd`; POSIX fallback uses `RLIMIT_NOFILE` range. Matches the stdlib `subprocess._posixsubprocess.close_fds` strategy. Returns close-count for sanity logging - wire into `fork_from_worker_thread._worker()`'s post-fork child prelude — runs immediately after the pid-pipe `os.close(rfd/wfd)`, before the user `child_target` callable executes - docstring cross-refs the diagnosis doc + spells out the FD-inheritance-cascade mechanism and why the close-all approach is safe for our spawn shape Validation pending: re-run `test_nested_multierrors[subint_forkserver]` to confirm the deadlock is gone. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 15:30:39 -04:00
Gud Boi	b71705bdcd	Refine `subint_forkserver` nested-cancel hang diagnosis Major rewrite of `subint_forkserver_test_cancellation_leak_issue.md` after empirical investigation revealed the earlier "descendant-leak + missing tree-kill" diagnosis conflated two unrelated symptoms: 1. 5-zombie leak holding `:1616` — turned out to be a self-inflicted cleanup bug: `pkill`-ing a bg pytest task (SIGTERM/SIGKILL, no SIGINT) skipped the SC graceful cancel cascade entirely. Codified the real fix — SIGINT-first ladder w/ bounded wait before SIGKILL — in `e5e2afb5` (`run-tests` SKILL) and `feedback_sc_graceful_cancel_first.md`. 2. `test_nested_multierrors[subint_forkserver]` hangs indefinitely — the actual backend bug, and it's a deadlock not a leak. Deats, - new diagnosis: all 5 procs are kernel-`S` in `do_epoll_wait`; pytest-main's trio-cache workers are in `os.waitpid` waiting for children that are themselves waiting on IPC that never arrives — graceful `Portal.cancel_actor` cascade never reaches its targets - tree-structure evidence: asymmetric depth across two identical `run_in_actor` calls — child 1 (3 threads) spawns both its grandchildren; child 2 (1 thread) never completes its first nursery `run_in_actor`. Smells like a race on fork- inherited state landing differently per spawn ordering - new hypothesis: `os.fork()` from a subactor inherits the ROOT parent's IPC listener FDs transitively. Grandchildren end up with three overlapping FD sets (own + direct-parent + root), so IPC routing becomes ambiguous. Predicts bug scales with fork depth — matches reality: single- level spawn works, multi-level hangs - ruled out: `_ForkedProc.kill()` tree-kill (never reaches hard-kill path), `:1616` contention (fixed by `reg_addr` fixture wiring), GIL starvation (each subactor has its own OS process+GIL), child-side KBI absorption (`_trio_main` only catches KBI at `trio.run()` callsite, reached only on trio-loop exit) - four fix directions ranked: (1) blanket post-fork `closerange()`, (2) `FD_CLOEXEC` + audit, (3) targeted FD cleanup via `actor.ipc_server` handle, (4) `os.posix_spawn` w/ `file_actions`. Vote: (3) — surgical, doesn't break the "no exec" design of `subint_forkserver` - standalone repro added (`spawn_and_error(breadth= 2, depth=1)` under `trio.fail_after(20)`) - stopgap: skip `test_nested_multierrors` + multi- level-spawn tests under the backend via `@pytest.mark.skipon_spawn_backend(...)` until fix lands Killing the "tree-kill descendants" fix-direction section: it addressed a bug that didn't exist. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 15:21:41 -04:00
Gud Boi	e5e2afb5f4	Use SIGINT-first ladder in `run-tests` cleanup The previous cleanup recipe went straight to SIGTERM+SIGKILL, which hides bugs: tractor is structured concurrent — `_trio_main` catches SIGINT as an OS-cancel and cascades `Portal.cancel_actor` over IPC to every descendant. So a graceful SIGINT exercises the actual SC teardown path; if it hangs, that's a real bug to file (the forkserver `:1616` zombie was originally suspected to be one of these but turned out to be a teardown gap in `_ForkedProc.kill()` instead). Deats, - step 1: `pkill -INT` scoped to `$(pwd)/py*` — no sleep yet, just send the signal - step 2: bounded wait loop (10 × 0.3s = ~3s) using `pgrep` to poll for exit. Loop breaks early on clean exit - step 3: `pkill -9` only if graceful timed out, w/ a logged escalation msg so it's obvious when SC teardown didn't complete - step 4: same SIGINT-first ladder for the rare `:1616`-holding zombie that doesn't match the cmdline pattern (find PID via `ss -tlnp`, then `kill -INT NNNN; sleep 1; kill -9 NNNN`) - steps 5-6: UDS-socket `rm -f` + re-verify unchanged Goal: surface real teardown bugs through the test- cleanup workflow instead of papering over them with `-9`. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 14:46:16 -04:00
Gud Boi	de6016763f	Wire `reg_addr` through leaky cancel tests Stopgap companion to `d0121960` (`subint_forkserver` test-cancellation leak doc): five tests in `tests/test_cancellation.py` were running against the default `:1616` registry, so any leaked `subint-forkserv` descendant from a prior test holds the port and blows up every subsequent run with `TooSlowError` / "address in use". Thread the session-unique `reg_addr` fixture through so each run picks its own port — zombies can no longer poison other tests (they'll only cross-contaminate whatever happens to share their port, which is now nothing). Deats, - add `reg_addr: tuple` fixture param to: - `test_cancel_infinite_streamer` - `test_some_cancels_all` - `test_nested_multierrors` - `test_cancel_via_SIGINT` - `test_cancel_via_SIGINT_other_task` - explicitly pass `registry_addrs=[reg_addr]` to the two `open_nursery()` calls that previously had no kwargs at all (in `test_cancel_via_SIGINT` and `test_cancel_via_SIGINT_other_task`) - add bounded `@pytest.mark.timeout(7, method='thread')` to `test_nested_multierrors` so a hung run doesn't wedge the whole session Still doesn't close the real leak — the `subint_forkserver` backend's `_ForkedProc.kill()` is PID-scoped not tree-scoped, so grandchildren survive teardown regardless of registry port. This commit is just blast-radius containment until that fix lands. See `ai/conc-anal/ subint_forkserver_test_cancellation_leak_issue.md`. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 14:37:48 -04:00
Gud Boi	d0121960b9	Add `subint_forkserver` test-cancellation leak doc New `ai/conc-anal/ subint_forkserver_test_cancellation_leak_issue.md` captures a descendant-leak surfaced while wiring `subint_forkserver` into the full test matrix: running `tests/test_cancellation.py` under `--spawn-backend=subint_forkserver` reproducibly leaks exactly 5 `subint-forkserv` comm-named child processes that survive session exit, each holding a `LISTEN` on `:1616` (the tractor default registry addr) — and therefore poisons every subsequent test session that defaults to that addr. Deats, - TL;DR + ruled-out checks confirming the procs are ours (not piker / other tractor-embedding apps) — `/proc/$pid/cmdline` + cwd both resolve to this repo's `py314/` venv - root cause: `_ForkedProc.kill()` is PID-scoped (plain `os.kill(SIGKILL)` to the direct child), not tree-scoped — grandchildren spawned during a multi-level cancel test get reparented to init and inherit the registry listen socket - proposed fix directions ranked: (1) put each forkserver-spawned subactor in its own process- group (`os.setpgrp()` in fork-child) + tree-kill via `os.killpg(pgid, SIGKILL)` on teardown, (2) `PR_SET_CHILD_SUBREAPER` on root, (3) explicit `/proc/<pid>/task//children` walk. Vote: (1) — POSIX-standard, aligns w/ `start_new_session=True` semantics in `subprocess.Popen` / trio's `open_process` - inline reproducer + cleanup recipe scoped to `$(pwd)/py314/bin/python.pytest.*spawn-backend= subint_forkserver` so cleanup doesn't false-flag unrelated tractor procs (consistent w/ `run-tests` skill's zombie-check guidance) Stopgap hygiene fix (wiring `reg_addr` through the 5 leaky tests in `test_cancellation.py`) is incoming as a follow-up — that one stops the blast radius, but zombies still accumulate per-run until the real tree-kill fix lands. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 13:58:42 -04:00
Gud Boi	aa7450974c	Add zombie-actor check to `run-tests` skill Fork-based backends (esp. `subint_forkserver`) can leak child actor processes on cancelled / SIGINT'd test runs; the zombies keep the tractor default registry (`127.0.0.1:1616` / `/tmp/registry@1616.sock`) bound, so every subsequent session can't bind and 50+ unrelated tests fail with the same `TooSlowError` / "address in use" signature. Document the pre-flight + post-cancel check as a mandatory step 4. Deats, - primary signal: `ss -tlnp \| grep ':1616'` for a bound TCP registry listener — the authoritative check since :1616 is unique to our runtime - `pgrep -af` scoped to `$(pwd)/py[0-9]/bin/python. _actor_child_main\|subint-forkserv` for leftover actor/forkserver procs — scoped deliberately so we don't false-flag legit long-running tractor- embedding apps like `piker` - `ls /tmp/registry@.sock` for stale UDS sockets - scoped cleanup recipe (SIGTERM + SIGKILL sweep using the same `$(pwd)/py` pattern, UDS `rm -f`, re-verify) plus a fallback for when a zombie holds :1616 but doesn't match the pattern: `ss -tlnp` → kill by PID - explicit false-positive warning calling out the `piker` case (`~/repos/piker/py*/bin/python3 -m tractor._child ...`) so a bare `pgrep` doesn't lead to nuking unrelated apps Goal: short-circuit the "spelunking into test code" rabbit-hole when the real cause is just a leaked PID from a prior session, without collateral damage to other tractor-embedding projects on the same box. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 13:43:57 -04:00
Gud Boi	6720c3bbb3	Mv `test_subint_cancellation.py` to `tests/spawn/` subpkg Also, some slight touchups in `.spawn._subint`.	2026-04-23 11:50:19 -04:00
Gud Boi	4425023500	Label forkserver child as `subint_forkserver` Follow-up to `72d1b901` (was prev commit adding `debug_mode` for `subint_forkserver`): that commit wired the runtime-side `subint_forkserver` SpawnSpec-recv gate in `Actor._from_parent`, but the `subint_forkserver_proc` child-target was still passing `spawn_method='trio'` to `_trio_main` — so `Actor.pformat()` / log lines would report the subactor as plain `'trio'` instead of the actual parent-side spawn mechanism. Flip the label to `'subint_forkserver'`. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 11:43:06 -04:00
Gud Boi	72d1b9016a	Enable `debug_mode` for `subint_forkserver` The `subint_forkserver` backend's child runtime is trio-native (uses `_trio_main` + receives `SpawnSpec` over IPC just like `trio`/`subint`), so `tractor.devx.debug._tty_lock` works in those subactors. Wire the runtime gates that historically hard-coded `_spawn_method == 'trio'` to recognize this third backend. Deats, - new `_DEBUG_COMPATIBLE_BACKENDS` module-const in `tractor._root` listing the spawn backends whose subactor runtime is trio-native (`'trio'`, `'subint_forkserver'`). Both the enable-site (`_runtime_vars['_debug_mode'] = True`) and the cleanup-site reset key. off the same tuple — keep them in lockstep when adding backends - `open_root_actor`'s `RuntimeError` for unsupported backends now reports the full compatible-set + the rejected method instead of the stale "only `trio`" msg. - `runtime._runtime.Actor._from_parent`'s SpawnSpec-recv gate adds `'subint_forkserver'` to the existing `('trio', 'subint')` tuple — fork child-side runtime receives the same SpawnSpec IPC handshake as the others. - `subint_forkserver_proc` child-target now passes `spawn_method='subint_forkserver'` (was hard-coded `'trio'`) so `Actor.pformat()` / log lines reflect the actual parent-side spawn mechanism rather than masquerading as plain `trio`. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 11:39:42 -04:00
Gud Boi	4227eabcba	Drop unneeded f-str prefixes	2026-04-23 11:04:10 -04:00
Gud Boi	af94080b21	Shorten some timeouts in `subint_forkserver` suites	2026-04-23 11:01:56 -04:00
Gud Boi	36cca96385	Refine `subint_forkserver` orphan-SIGINT diagnosis Empirical follow-up to the xfail'd orphan-SIGINT test: the hang is not "trio can't install a handler on a non-main thread" (the original hypothesis from the `child_sigint` scaffold commit). On py3.14: - `threading.current_thread() is threading.main_thread()` IS True post-fork — CPython re-designates the fork-inheriting thread as "main" correctly - trio's `KIManager` SIGINT handler IS installed in the subactor (`signal.getsignal(SIGINT)` confirms) - the kernel DOES deliver SIGINT to the thread But `faulthandler` dumps show the subactor wedged in `trio/_core/_io_epoll.py::get_events` — trio's wakeup-fd mechanism (which turns SIGINT into an epoll-wake) isn't firing. So the `except KeyboardInterrupt` at `tractor/spawn/_entry.py::_trio_main:164` — the runtime's intentional "KBI-as-OS-cancel" path — never fires. Deats, - new `ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md` (+385 LOC): full writeup — TL;DR, symptom reproducer, the "intentional cancel path" the bug defeats, diagnostic evidence (`faulthandler` output + `getsignal` probe), ruled-out hypotheses (non-main-thread issue, wakeup-fd inheritance, KBI-as-trio-check-exception), and fix directions - `test_orphaned_subactor_sigint_cleanup_DRAFT` xfail `reason` + test docstring rewritten to match the refined understanding — old wording blamed the non-main-thread path, new wording points at the `epoll_wait` wedge + cross-refs the new conc-anal doc - `_subint_forkserver` module docstring's `child_sigint='trio'` bullet updated: now notes trio's handler is already correctly installed, so the flag may end up a no-op / doc-only mode once the real root cause is fixed Closing the gap aligns with existing design intent (make the already-designed "KBI-as-OS-cancel" behavior actually fire), not a new feature. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 09:31:32 -04:00
Gud Boi	8a3f94ace0	Scaffold `child_sigint` modes for forkserver Add configuration surface for future child-side SIGINT plumbing in `subint_forkserver_proc` without wiring up the actual trio-native SIGINT bridge — lifting one entry-guard clause will flip the `'trio'` branch live once the underlying fork-prelude plumbing is implemented. Deats, - new `ChildSigintMode = Literal['ipc', 'trio']` type + `_DEFAULT_CHILD_SIGINT = 'ipc'` module-level default. Docstring block enumerates both: - `'ipc'` (default, currently the only implemented mode): no child-side SIGINT handler — `trio.run()` is on the fork-inherited non-main thread where `signal.set_wakeup_fd()` is main-thread-only, so cancellation flows exclusively via the parent's `Portal.cancel_actor()` IPC path. Known gap: orphan children don't respond to SIGINT (`test_orphaned_subactor_sigint_cleanup_DRAFT`) - `'trio'` (scaffolded only): manual SIGINT → trio-cancel bridge in the fork-child prelude so external Ctrl-C reaches stuck grandchildren even w/ a dead parent - `subint_forkserver_proc` pulls `child_sigint` out of `proc_kwargs` (matches how `trio_proc` threads config to `open_process`, keeps `start_actor(proc_kwargs=...)` as the ergonomic entry point); validates membership + raises `NotImplementedError` for `'trio'` at the backend-entry guard - `_child_target` grows a `match child_sigint:` arm that slots in the future `'trio'` impl without restructuring — today only the `'ipc'` case is reachable - module docstring "Still-open work" list grows a bullet pointing at this config + the xfail'd orphan-SIGINT test No behavioral change on the default path — `'ipc'` is the existing flow. Scaffolding only. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-22 20:08:30 -04:00
Gud Boi	253e7cbd1c	Add DRAFT `subint_forkserver` orphan-SIGINT test Tier-4 test `test_orphaned_subactor_sigint_cleanup_DRAFT` documents an empirical SIGINT-delivery gap in the `subint_forkserver` backend: when the parent dies via `SIGKILL` (no IPC `Portal.cancel_actor()` possible) and `SIGINT` is sent to the orphan child, the child DOES NOT unwind — CPython's default `KeyboardInterrupt` is delivered to `threading.main_thread()`, whose tstate is dead in the post-fork child bc fork inherited the worker thread, not main. Trio running on the fork-inherited worker thread therefore never observes the signal. Marked `xfail(strict=True)` so the mark flips to XPASS→fail once the backend grows explicit SIGINT plumbing. Deats, - harness runs the failure-mode sequence out-of-process: 1. harness subprocess runs a fresh Python script that calls `try_set_start_method('subint_forkserver')` then opens a root actor + one `sleep_forever` subactor 2. parse `PARENT_READY=<pid>` + `CHILD_PID=<pid>` markers off harness `stdout` to confirm IPC handshake completed 3. `SIGKILL` the parent, `proc.wait()` to reap the zombie (otherwise `os.kill(pid, 0)` keeps reporting it alive) 4. assert the child survived the parent-reap (i.e. was actually orphaned, not reaped too) before moving on 5. `SIGINT` the orphan child, poll `os.kill(child_pid, 0)` every 100ms for up to 10s - supporting helpers: `_read_marker()` with per-proc bytes-buffer to carry partial lines across calls, `_process_alive()` liveness probe via `kill(pid, 0)` - Linux-only via `platform.system() != 'Linux'` skip — orphan-reparenting semantics don't generalize to other platforms - port offset (`reg_addr[1] + 17`) so the harness listener doesn't race concurrently-running backend tests - best-effort `finally:` cleanup: `SIGKILL` any still-alive pids + `proc.kill()` + bounded `proc.wait()` to avoid leaking orphans across the session Also, tier-4 header comment documents the cross-backend generalization path: applicable to any multi-process backend (`trio`, `mp_spawn`, `mp_forkserver`, `subint_forkserver`), NOT to plain `subint` (in-process subints have no orphan OS-child). Move path: lift harness into `tests/_orphan_harness.py`, parametrize on session `_spawn_method`, add `skipif _spawn_method == 'subint'`. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-22 19:39:41 -04:00
Gud Boi	27cc02e83d	Refactor `_runtime_vars` into pure get/set API Post-fork `_runtime_vars` reset in `subint_forkserver_proc` was previously done via direct mutation of `_state._runtime_vars` from an external module + an inline default dict duplicating the `_state.py`-internal defaults. Split the access surface into a pure getter + explicit setter so the reset call site becomes a one-liner composition. Deats `tractor/runtime/_state.py`, - extract initial values into a module-level `_RUNTIME_VARS_DEFAULTS: dict[str, Any]` constant; the live `_runtime_vars` is now initialised from `dict(_RUNTIME_VARS_DEFAULTS)` - `get_runtime_vars()` grows a `clear_values: bool = False` kwarg. When True, returns a fresh copy of `_RUNTIME_VARS_DEFAULTS` instead of the live dict — still a pure read, never mutates anything - new `set_runtime_vars(rtvars: dict \| RuntimeVars)` — atomic replacement of the live dict's contents via `.clear()` + `.update()`, so existing references to the same dict object remain valid. Accepts either the historical dict form or the `RuntimeVars` struct Deats `tractor/spawn/_subint_forkserver.py`, - collapse the prior ad-hoc `.update({...})` block into `set_runtime_vars(get_runtime_vars(clear_values=True))` - drop the `_state._current_actor = None` line — `_trio_main` unconditionally overwrites it downstream, so no explicit reset needed (noted in the XXX comment) (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-22 19:10:06 -04:00
Gud Boi	66dda9e449	Reset post-fork `_state` in forkserver child `os.fork()` inherits the parent's entire memory image, including `tractor.runtime._state` globals that encode "this process is the root actor" — `_runtime_vars`'s `_is_root=True`, pre-populated `_root_mailbox` + `_registry_addrs`, and the parent's `_current_actor` singleton. A fresh `exec`-based child starts with those globals at their module-level defaults (all falsey/empty). The forkserver child needs to match that shape BEFORE calling `_actor_child_main()`, otherwise `Actor.__init__()` takes the `is_root_process() == True` branch and pre-populates `self.enable_modules`, which then trips `assert not self.enable_modules` at the top of `Actor._from_parent()` on the subsequent parent→child `SpawnSpec` handshake. Fix: at the start of `_child_target`, null `_state._current_actor` and overwrite `_runtime_vars` with a cold-root blank (`_is_root=False`, empty mailbox/addrs, `_debug_mode=False`) before `_actor_child_main()` runs. Found-via: `test_subint_forkserver_spawn_basic` hitting the `enable_modules` assert on child-side runtime boot. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-22 19:01:27 -04:00
Gud Boi	43bd6a6410	Wire `subint_forkserver` as first-class backend Promote `_subint_forkserver` from primitives-only into a registered spawn backend: `'subint_forkserver'` is now a `SpawnMethodKey` literal, dispatched via `_methods` to the new `subint_forkserver_proc()` target, feature-gated under the existing `subint`-family py3.14+ case, and selectable via `--spawn-backend=subint_forkserver`. Deats, - new `subint_forkserver_proc()` spawn target in `_subint_forkserver`: - mirrors `trio_proc()`'s supervision model — real OS subprocess so `Portal.cancel_actor()` + `soft_kill()` on graceful teardown, `os.kill(SIGKILL)` on hard-reap (no `_interpreters.destroy()` race to fuss over bc the child lives in its own process) - only real diff from `trio_proc` is the spawn mechanism: fork from a main-interp worker thread via `fork_from_worker_thread()` (off-loaded to trio's thread pool) instead of `trio.lowlevel.open_process()` - child-side `_child_target` closure runs `tractor._child._actor_child_main()` with `spawn_method='trio'` — the child is a regular trio actor, "subint_forkserver" names how the parent spawned, not what the child runs - new `_ForkedProc` class — thin `trio.Process`-compatible shim around a raw OS pid: `.poll()` via `waitpid(WNOHANG)`, async `.wait()` off-loaded to a trio cache thread, `.kill()` via `SIGKILL`, `.returncode` cached for repeat calls. `.stdin`/`.stdout`/`.stderr` are `None` (fork-w/o-exec inherits parent FDs; we don't marshal them) which matches `soft_kill()`'s `is not None` guards Also, new backend-tier test `test_subint_forkserver_spawn_basic` drives the registered backend end-to-end via `open_root_actor` + `open_nursery` + `run_in_actor` w/ a trivial portal-RPC round-trip. Uses a `forkserver_spawn_method` fixture to flip `_spawn_method`/`_ctx` for the test's duration + restore on teardown (so other session-level tests don't observe the global flip). Test module docstring reworked to describe the three tiers now covered: (1) primitive-level, (2) parent-trio-driven primitives, (3) full registered backend. Status: still-open work (tracked on `tractor#379`) doc'd inline in the module docstring — no cancel/hard-kill stress coverage yet, child-side subint-hosted root runtime still future (gated on `msgspec#563`), thread-hygiene audit pending the same unblock. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-22 18:49:23 -04:00
Gud Boi	5bd5f957d3	Add `subint_forkserver` PEP 684 audit-plan doc Follow-up tracker companion to the module-docstring TODO added in `372a0f32`. Catalogs why `_subint_forkserver`'s two "non-trio thread" constraints (`fork_from_worker_thread()` + `run_subint_in_worker_thread()` both allocating dedicated `threading.Thread`s; test helper named `run_fork_in_non_trio_thread`) exist today, and which of them would dissolve once msgspec PEP 684 support ships (`msgspec#563`) and tractor flips to isolated-mode subints. Deats, - three reasons enumerated for the current constraints: - class-A GIL-starvation — fixed by isolated mode: subints don't share main's GIL so abandoned-thread contention disappears - destroy race / tstate-recycling from `subint_proc` — unclear: `_PyXI_Enter` + `_PyXI_Exit` are cross-mode, so isolated doesn't obviously fix it; needs empirical retest on py3.14 + isolated API - fork-from-main-interp-tstate (the CPython-level `_PyInterpreterState_DeleteExceptMain` gate) — the narrow reason for using a dedicated thread; probably fixed IF the destroy-race also resolves (bc trio's cache threads never drove subints → clean main-interp tstate) - TL;DR table of which constraints unwind under each resolution branch - four-step audit plan for when `msgspec#563` lands: - flip `_subint` to isolated mode - empirical destroy-race retest - audit `_subint_forkserver.py` — drop `non_trio` qualifier / maybe inline primitives - doc fallout — close the three `subint_*_issue.md` siblings w/ post-mortem notes Also, cross-refs the three sibling `conc-anal/` docs, PEPs 684 + 734, `msgspec#563`, and `tractor#379` (the overall subint spawn-backend tracking issue). (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-22 18:18:30 -04:00
Gud Boi	1d4867e51c	Add trio-parent tests for `_subint_forkserver` New pytest module `tests/spawn/test_subint_forkserver.py` drives the forkserver primitives from inside a real `trio.run()` in the parent — the runtime shape tractor will actually use when we wire up a `subint_forkserver` spawn backend proper. Complements the standalone no-trio-in-parent `ai/conc-anal/subint_fork_from_main_thread_smoketest.py`. Deats, - new test pkg `tests/spawn/` (+ empty `__init__.py`) - two tests, both `@pytest.mark.timeout(30, method='thread')` for the GIL-hostage safety reason doc'd in `ai/conc-anal/subint_sigint_starvation_issue.md`: - `test_fork_from_worker_thread_via_trio` — parent-side plumbing baseline. `trio.run()` off-loads forkserver prims via `trio.to_thread.run_sync()` + asserts the child reaps cleanly - `test_fork_and_run_trio_in_child` — end-to-end: forked child calls `run_subint_in_worker_thread()` with a bootstrap str that does `trio.run()` in a fresh subint - both tests wrap the inner `trio.run()` in a `dump_on_hang()` for post-mortem if the outer `pytest-timeout` fires - intentionally NOT using `--spawn-backend` — the tests drive the primitives directly rather than going through tractor's spawn-method registry (which the forkserver isn't plugged into yet) Also, rename `run_trio_in_subint()` → `run_subint_in_worker_thread()` for naming consistency with the sibling `fork_from_worker_thread()`. The action is really "host a subint on a worker thread", not specifically "run trio" — trio just happens to be the typical payload. Propagate the rename to the smoketest. Further, add a "TODO — cleanup gated on msgspec PEP 684 support" section to the `_subint_forkserver` module docstring: flags the dedicated-`threading.Thread` design as potentially-revisable once isolated-mode subints are viable in tractor. Cross-refs `msgspec#563` + `tractor#379` and points at an audit-plan conc-anal doc we'll add next. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-22 18:00:06 -04:00
Gud Boi	372a0f3247	Lift fork prims into `_subint_forkserver` mod The smoketest (prior commit) empirically validated the "fork-from-main-interp-worker-thread" arch on py3.14. Promote the validated primitives out of the `ai/conc-anal/` smoketest into `tractor.spawn._subint_forkserver` so they can eventually be wired into a real "subint forkserver" spawn backend. Deats, - new module `tractor/spawn/_subint_forkserver.py` (337 LOC): - `fork_from_worker_thread(child_target, thread_name)` — spawn a main-interp `threading.Thread`, call `os.fork()` from it, shuttle the child pid back to main via a pipe - `run_trio_in_subint(bootstrap, ...)` — post-fork helper: create a fresh subint + drive `_interpreters.exec()` on a dedicated worker thread running the `bootstrap` str (typically imports `trio`, defines an async entry, calls `trio.run()`) - `wait_child(pid, expect_exit_ok)` — `os.waitpid()` + pass/fail classification reusable from harness AND the eventual real spawn path - feature-gated py3.14+ via the public `concurrent.interpreters` presence check; matches the gate in `tractor.spawn._subint` - module docstring doc's the CPython-block context (cross-refs `_subint_fork` stub + the two `conc-anal/` docs) and status: EXPERIMENTAL, not yet registered in `_spawn._methods` Also, refactor the smoketest `ai/conc-anal/subint_fork_from_main_thread_smoketest.py` to import the primitives from the new module rather than inline its own copies. Keeps the smoketest and the tractor-side impl in sync as the forkserver design evolves; the smoketest remains a zero-`tractor`-runtime CPython-level check (imports ONLY the three primitives, no runtime bring-up). Status: next step is to drive these from a parent-side `trio.run()` and hook the returned child pid into the normal actor-nursery/IPC flow — then register `subint_forkserver` as a `SpawnMethodKey` in `_spawn.py`. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-22 17:03:15 -04:00
Gud Boi	37cfaa202b	Add CPython-level `subint_fork` workaround smoketest Standalone script to validate the "main-interp worker-thread forkserver + subint-hosted trio" arch proposed as a workaround to the CPython-level refusal doc'd in `ai/conc-anal/subint_fork_blocked_by_cpython_post_fork_issue.md`. Deliberately NOT a `tractor` test — zero `tractor` imports. Uses `_interpreters` (private stdlib) + `os.fork()` directly so pass/fail is a property of CPython alone, independent of our runtime. Requires py3.14+. Deats, - four scenarios via `--scenario`: - `control_subint_thread_fork` — the KNOWN-BROKEN case as a harness sanity; if the child DOESN'T abort, our analysis is wrong - `main_thread_fork` — baseline sanity, must always succeed - `worker_thread_fork` — architectural assertion: regular `threading.Thread` attached to main interp calls `os.fork()`; child should survive post-fork cleanup - `full_architecture` — end-to-end: fork from a main-interp worker thread, then in child create a subint driving a worker thread running `trio.run()` - exit code 0 on EXPECTED outcome (for `control_*` that means "child aborted", not "child succeeded") - each scenario prints a self-contained pass/fail banner; use `os.waitpid()` of the parent + per-scenario status prints to observe the child's fate Also, log NLNet provenance for this session's three-sub-phase work (py3.13 gate tightening, `pytest-timeout` + marker refactor, `subint_fork` prototype → CPython-block finding). Prompt-IO: ai/prompt-io/claude/20260422T200723Z_797f57c_prompt_io.md (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-22 16:40:52 -04:00
Gud Boi	797f57ce7b	Doc `subint_fork` as blocked by CPython post-fork Empirical finding: the WIP `subint_fork_proc` scaffold landed in `cf0e3e6f` does not work on current CPython. The `fork()` syscall succeeds in the parent, but the CHILD aborts immediately during `PyOS_AfterFork_Child()` → `_PyInterpreterState_DeleteExceptMain()`, which gates on the current tstate belonging to the main interp — the child dies with `Fatal Python error: not main interpreter`. CPython devs acknowledge the fragility with an in-source comment (`// Ideally we could guarantee tstate is running main.`) but expose no user-facing hook to satisfy the precondition — so the strategy is structurally dead until upstream changes. Rather than delete the scaffold, reshape it into a documented dead-end so the next person with this idea lands on the reason rather than rediscovering the same CPython-level refusal. Deats, - Move `subint_fork_proc` out of `tractor.spawn._subint` into a new `tractor.spawn._subint_fork` dedicated module (153 LOC). Module + fn docstrings now describe the blockage directly; the fn body is trimmed to a `NotImplementedError` pointing at the analysis doc — no more dead-code `bootstrap` sketch bloating `_subint.py`. - `_spawn.py`: keep `'subint_fork'` in `SpawnMethodKey` + the `_methods` dispatch so `--spawn-backend=subint_fork` routes to a clean `NotImplementedError` rather than "invalid backend"; comment calls out the blockage. Collapse the duplicate py3.14 feature-gate in `try_set_start_method()` into a combined `case 'subint' \| 'subint_fork':` arm. - New 337-line analysis: `ai/conc-anal/subint_fork_blocked_by_cpython_post_fork_issue.md`. Annotated walkthrough from the user-visible fatal error down to the specific `Modules/posixmodule.c` + `Python/pystate.c` source lines enforcing the refusal, plus an upstream-report draft. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-22 16:02:01 -04:00
Gud Boi	cf0e3e6f8b	Add WIP `subint_fork_proc` backend scaffold Experimental third spawn backend: use a fresh sub-interpreter purely as a trio-free launchpad from which to `os.fork()` + exec back into `python -m tractor._child`. Per issue #379's "fork()-workaround/hacks" thread. Intent is to sidestep both, - the trio+fork hazards hitting `trio_proc` (python- trio/trio#1614 et al.), since the forking interp is guaranteed trio-free. - the shared-GIL abandoned-thread hazards hitting `subint_proc` (`ai/conc-anal/subint_sigint_starvation_issue.md`), since we don't stay in the subint — it only lives long enough to call `os.fork()` Downstream of the fork+exec, all the existing `trio_proc` plumbing is reused verbatim: `ipc_server.wait_for_peer()`, `SpawnSpec`, `Portal` yield, soft-kill. Status: NOT wired up beyond scaffolding. The fn raises `NotImplementedError` immediately; the `bootstrap` fork/exec string builder and the `# TODO: orchestrate driver thread` block are kept in-tree as deliberate dead code so the next iteration starts from a concrete shape rather than a blank page. Docstring calls out three open questions that need empirical validation before wiring this up: 1. Does CPython permit `os.fork()` from a non-main legacy subint? 2. Can the child stay fork-without-exec and `trio.run()` directly from within the launchpad subint? 3. How do `signal.set_wakeup_fd()` handlers and other process-global state interact when the forking thread is inside a subint? (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-22 13:32:39 -04:00
Gud Boi	99d70337b7	Mark `subint`-hanging tests with `skipon_spawn_backend` Adopt the `@pytest.mark.skipon_spawn_backend('subint', reason=...)` marker (`a617b521`) across the suites reproducing the `subint` GIL-contention / starvation hang classes doc'd in `ai/conc-anal/subint_*_issue.md`. Deats, - Module-level `pytestmark` on full-file-hanging suites: - `tests/test_cancellation.py` - `tests/test_inter_peer_cancellation.py` - `tests/test_pubsub.py` - `tests/test_shm.py` - Per-test decorator where only one test in the file hangs: - `tests/discovery/test_registrar.py ::test_stale_entry_is_deleted` — replaces the inline `if start_method == 'subint': pytest.skip` branch with a declarative skip. - `tests/test_subint_cancellation.py ::test_subint_non_checkpointing_child`. - A few per-test decorators are left commented-in- place as breadcrumbs for later finer-grained unskips. Also, some nearby tidying in the affected files: - Annotate loose fixture / test params (`pytest.FixtureRequest`, `str`, `tuple`, `bool`) in `tests/conftest.py`, `tests/devx/conftest.py`, and `tests/test_cancellation.py`. - Normalize `"""..."""` → `'''...'''` docstrings per repo convention on a few touched tests. - Add `timeout=6` / `timeout=10` to `@tractor_test(...)` on `test_cancel_infinite_streamer` and `test_some_cancels_all`. - Drop redundant `spawn_backend` param from `test_cancel_via_SIGINT`; use `start_method` in the `'mp' in ...` check instead. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-21 21:33:15 -04:00
Gud Boi	a617b52140	Add `skipon_spawn_backend` pytest marker A reusable `@pytest.mark.skipon_spawn_backend( '<backend>' [, ...], reason='...')` marker for backend-specific known-hang / -borked cases — avoids scattering `@pytest.mark.skipif(lambda ...)` branches across tests that misbehave under a particular `--spawn-backend`. Deats, - `pytest_configure()` registers the marker via `addinivalue_line('markers', ...)`. - New `pytest_collection_modifyitems()` hook walks each collected item with `item.iter_markers( name='skipon_spawn_backend')`, checks whether the active `--spawn-backend` appears in `mark.args`, and if so injects a concrete `pytest.mark.skip( reason=...)`. `iter_markers()` makes the decorator work at function, class, or module (`pytestmark = [...]`) scope transparently. - First matching mark wins; default reason is `f'Borked on --spawn-backend={backend!r}'` if the caller doesn't supply one. Also, tighten type annotations on nearby `pytest` integration points — `pytest_configure`, `debug_mode`, `spawn_backend`, `tpt_protos`, `tpt_proto` — now taking typed `pytest.Config` / `pytest.FixtureRequest` params. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-21 21:24:51 -04:00
Gud Boi	e3318a5483	Expand `subint` sigint-starvation hang catalog Add two more tests to the catalog in `conc-anal/subint_sigint_starvation_issue.md` — same signal-wakeup-fd-saturation fingerprint (abandoned legacy-subint driver threads → shared-GIL starvation → `write() = EAGAIN` on the wakeup pipe → silent SIGINT drop), different load patterns. Deats, - `test_cancel_while_childs_child_in_sync_sleep[subint-False]`: nested actor-tree + sync-sleeping grandchild. Under `trio`/`mp_*` the "zombie reaper" is a subproc `SIGKILL`; no equivalent exists under subint, so the grandchild persists in its abandoned driver thread. Often only manifests under full-suite runs (earlier tests seed the abandoned-thread pool). - `test_multierror_fast_nursery[subint-25-0.5]`: 25 concurrent subactors all go through teardown on the multierror. Bounded hard-kills run in parallel — so the total budget is ~3s, not 3s × 25. Leaves 25 abandoned driver threads simultaneously alive, an extreme pressure multiplier. `strace` shows several successful `write(16, "\2", 1) = 1` (GIL round-robin IS giving main brief slices) before finally saturating with `EAGAIN`. Also include a `pstree -snapt <pid>` capture showing 16+ live `{subint-driver[<interp_id>}` threads at the moment of hang — the direct GIL-contender population. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-21 17:42:37 -04:00
Gud Boi	e2b1035ff0	Skip `test_stale_entry_is_deleted` hanger with `subint`s	2026-04-21 13:36:41 -04:00
Gud Boi	17f1ea5910	Add global 200s `pytest-timeout`	2026-04-21 13:33:34 -04:00
Gud Boi	9bd113154b	Bump lock-file for `pytest-timeout` + 3.13 gated wheel-deps	2026-04-20 20:57:26 -04:00
Gud Boi	5ea5fb211d	Wall-cap `subint` audit tests via `pytest-timeout` Add a hard process-level wall-clock bound on the two known-hanging subint-backend tests so an unattended suite run can't wedge indefinitely in either of the hang classes doc'd in `ai/conc-anal/`. Deats, - New `testing` dep: `pytest-timeout>=2.3`. - `test_stale_entry_is_deleted`: `@pytest.mark.timeout(3, method='thread')`. The `method='thread'` choice is deliberate — `method='signal'` routes via `SIGALRM` which is starved by the same GIL-hostage path that drops `SIGINT` (see `subint_sigint_starvation_issue.md`), so it'd never actually fire in the starvation case. - `test_subint_non_checkpointing_child`: same decorator, same reasoning (defense-in-depth over the inner `trio.fail_after(15)`). At timeout, `pytest-timeout` hard-kills the pytest process itself — that's the intended behavior here; the alternative is the suite never returning. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-20 20:45:56 -04:00
Gud Boi	489dc6d0cc	Add prompt-io log for `subint` hang-class docs Log the `claude-opus-4-7` collab that produced `e92e3cd2` ("Doc `subint` backend hang classes + arm `dump_on_hang`"). Substantive bc the two new `ai/conc-anal/` docs were jointly authored — user framed the two-class split + set candidate-fix ordering for the class-2 (Ctrl-C-able) hang; claude drafted the prose and the test-side cross-linking comments. `.raw.md` is in diff-ref mode — per-file pointers via `git diff e92e3cd2~1..e92e3cd2 -- <path>` rather than re-embedding content that already lives in `git log -p`. Prompt-IO: ai/prompt-io/claude/20260420T192739Z_5e8cd8b2_prompt_io.md (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-20 16:41:07 -04:00
Gud Boi	35796ec8ae	Doc `subint` backend hang classes + arm `dump_on_hang` Classify and write up the two distinct hang modes hit during Phase B subint bringup (issue #379) so future triage doesn't re-derive them from scratch. Deats, two new `ai/conc-anal/` docs, - `subint_sigint_starvation_issue.md`: abandoned legacy-subint thread + shared GIL → main trio loop starves → signal-wakeup-fd pipe fills → `SIGINT` silently dropped (`strace` shows `write() = EAGAIN` on the wakeup-fd). Un- Ctrl-C-able. Structurally a CPython limit; blocked on `msgspec` PEP 684 (jcrist/msgspec#563) - `subint_cancel_delivery_hang_issue.md`: parent-side trio task parks on an orphaned IPC channel after subint teardown — no clean EOF delivered to the waiting receive. Ctrl-C-able (main loop iterates fine); OUR bug to fix. Candidate fix: explicit parent-side channel abort in `subint_proc`'s hard-kill teardown Cross-link the docs from their test reproducers, - `test_stale_entry_is_deleted` (→ starvation class): wrap `trio.run(main)` in `dump_on_hang(seconds=20)` so a future regression captures a stack dump. Kept un- skipped so the dump file is inspectable - `test_subint_non_checkpointing_child` (→ delivery class): extend docstring with a "KNOWN ISSUE" block pointing at the analysis (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-20 16:41:07 -04:00
Gud Boi	baf7ec54ac	Add `subint` cancellation + hard-kill test audit Lock in the escape-hatch machinery added to `tractor.spawn._subint` during the Phase B.2/B.3 bringup (issue #379) so future stdlib regressions or our own refactors don't silently re-introduce the mid-suite hangs. Deats, - `test_subint_happy_teardown`: baseline — spawn a subactor, one portal RPC, clean teardown. If this breaks, something's wrong unrelated to the hard-kill shields. - `test_subint_non_checkpointing_child`: cancel a subactor stuck in a non-checkpointing Python loop (`threading.Event.wait()` releases the GIL but never inserts a trio checkpoint). Validates the bounded-shield + daemon-driver-thread combo abandons the thread after `_HARD_KILL_TIMEOUT`. Every test is wrapped in `trio.fail_after()` for a deterministic per-test wall-clock ceiling (an unbounded audit would defeat itself) and arms `tractor.devx.dump_on_hang()` so a hang captures a stack dump — pytest's stderr capture swallows `faulthandler` output by default. Gated via `pytest.importorskip('concurrent.interpreters')` and a module-level skip when `--spawn-backend` isn't `'subint'`. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-20 16:41:07 -04:00
Gud Boi	79390a4e3a	Raise `subint` floor to py3.14 and split dep-groups The private `_interpreters` C module ships since 3.13, but that vintage wedges under our `threading.Thread` + multi-trio usage pattern —> `_interpreters.exec()` silently never makes progress. 3.14 fixes it. So gate on the presence of the public `concurrent.interpreters` wrapper (3.14+ only) even tho we still call into the private module at runtime. Deats, - `try_set_start_method('subint')` error msg + `_subint` module docstring/comments rewritten to document the 3.14 floor and why 3.13 can't work. - `_subint._has_subints` gate now imports `concurrent.interpreters` (not `_interpreters`) as the version sentinel. Also, reshuffle `pyproject.toml` deps into per-python-version `[tool.uv.dependency-groups]`: - `subints` group: `msgspec>=0.21.0`, py>=3.14 - `eventfd` group: `cffi>=1.17.1`, py>=3.13,<3.14 - `sync_pause` group: `greenback`, py>=3.13,<3.14 (was in `devx`; moved out bc no 3.14 yet) Bump top-level `msgspec>=0.20.0` too. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-20 16:41:03 -04:00
Gud Boi	dbe2e8bd82	Add `._debug_hangs` to `.devx` for hang triage Bottle up the diagnostic primitives that actually cracked the silent mid-suite hangs in the `subint` spawn-backend bringup (issue there" session has them on the shelf instead of reinventing from scratch. Deats, - `dump_on_hang(seconds, , path)` — context manager wrapping `faulthandler.dump_traceback_later()`. Critical gotcha baked in: dumps go to a file, not `sys.stderr`, bc pytest's stderr capture silently eats the output and you can spend an hour convinced you're looking at the wrong thing - `track_resource_deltas(label, , writer)` — context manager logging per-block `(threading.active_count(), len(_interpreters.list_all()))` deltas; quickly rules out leak-accumulation theories when a suite progressively worsens (if counts don't grow, it's not a leak, look for a race on shared cleanup instead) - `resource_delta_fixture(*, autouse, writer)` — factory returning a `pytest` fixture wrapping `track_resource_deltas` per-test; opt in by importing into a `conftest.py`. Kept as a factory (not a bare fixture) so callers own `autouse` / `writer` wiring Also, - export the three names from `tractor.devx` - dep-free on py<3.13 (swallows `ImportError` for `_interpreters`) - link back to the provenance in the module docstring (issue #379 / commit `26fb820`) (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-20 16:39:32 -04:00
Gud Boi	fe753f9656	Bound subint teardown shields with hard-kill timeout Unbounded `trio.CancelScope(shield=True)` at the soft-kill and thread-join sites can wedge the parent trio loop indefinitely when a stuck subint ignores portal-cancel (e.g. bc the IPC channel is already broken). Deats, - add `_HARD_KILL_TIMEOUT` (3s) module-level const - wrap both shield sites with `trio.move_on_after()` so we abandon a stuck subint after the deadline - flip driver thread to `daemon=True` so proc-exit also isn't blocked by a wedged subint - pass `abandon_on_cancel=True` to `trio.to_thread.run_sync(driver_thread.join)` — load-bearing for `move_on_after` to actually fire - log warnings when either timeout triggers - improve `InterpreterError` log msg to explain the abandoned-thread scenario (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-20 16:39:32 -04:00
Gud Boi	b48503085f	Add prompt-IO log for subint destroy-race fix Log the `claude-opus-4-7` session that produced the `_subint.py` dedicated-thread fix (`26fb8206`). Substantive bc the patch was entirely AI-generated; raw log also preserves the CPython-internals research informing Phase B.3 hard-kill work. Prompt-IO: ai/prompt-io/claude/20260418T042526Z_26fb820_prompt_io.md (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-20 16:39:32 -04:00
Gud Boi	3869a9b468	Fix subint destroy race via dedicated OS thread `trio.to_thread.run_sync(_interpreters.exec, ...)` runs `exec()` on a cached worker thread — and when that thread is returned to the cache after the subint's `trio.run()` exits, CPython still keeps the subint's tstate attached to the (now idle) worker. Result: the teardown `_interpreters.destroy(interp_id)` in the `finally` block can block the parent's trio loop indefinitely, waiting for a tstate release that only happens when the worker either picks up a new job or exits. Manifested as intermittent mid-suite hangs under `--spawn-backend=subint` — caught by a `faulthandler.dump_traceback_later()` showing the main thread stuck in `_interpreters.destroy()` at `_subint.py:293` with only an idle trio-cache worker as the other live thread. Deats, - drive the subint on a plain `threading.Thread` (not `trio.to_thread`) so the OS thread truly exits after `_interpreters.exec()` returns, releasing tstate and unblocking destroy - signal `subint_exited.set()` back to the parent trio loop from the driver thread via `trio.from_thread.run_sync(..., trio_token=...)` — capture the token at `subint_proc` entry - swallow `trio.RunFinishedError` in that signal path for the case where parent trio has already exited (proc teardown) - in the teardown `finally`, off-load the sync `driver_thread.join()` to `trio.to_thread.run_sync` (cache thread w/ no subint tstate → safe) so we actually wait for the driver to exit before `_interpreters.destroy()` (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-20 16:39:32 -04:00
Gud Boi	c2d8e967aa	Doc the `_interpreters` private-API choice in `_subint` Expand the comment block above the `_interpreters` import explaining why we use the private C mod over `concurrent.interpreters`: the public API only exposes PEP 734's `'isolated'` config which breaks `msgspec` (missing PEP 684 slot). Add reference links to PEP 734, PEP 684, cpython sources, and the msgspec upstream tracker (jcrist/msgspec#563). Also, - update error msgs in both `_spawn.py` and `_subint.py` to say "3.13+" (matching the actual `_interpreters` availability) instead of "3.14+". - tweak the mod docstring to reflect py3.13+ availability via the private C module. Review: PR #444 (copilot-pull-request-reviewer) https://github.com/goodboy/tractor/pull/444 (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-20 16:39:32 -04:00
Gud Boi	1037c05eaf	Avoid skip `.ipc._ringbuf` import when no `cffi`	2026-04-20 16:39:32 -04:00
Gud Boi	959fc48806	Impl min-viable `subint` spawn backend (B.2) Replace the B.1 scaffold stub w/ a working spawn flow driving PEP 734 sub-interpreters on dedicated OS threads. Deats, - use private `_interpreters` C mod (not the public `concurrent.interpreters` API) to get `'legacy'` subint config — avoids PEP 684 C-ext compat issues w/ `msgspec` and other deps missing the `Py_mod_multiple_interpreters` slot - bootstrap subint via code-string calling new `_actor_child_main()` from `_child.py` (shared entry for both CLI and subint backends) - drive subint lifetime on an OS thread using `trio.to_thread.run_sync(_interpreters.exec, ..)` - full supervision lifecycle mirrors `trio_proc`: `ipc_server.wait_for_peer()` → send `SpawnSpec` → yield `Portal` via `task_status.started()` - graceful shutdown awaits the subint's inner `trio.run()` completing; cancel path sends `portal.cancel_actor()` then waits for thread join before `_interpreters.destroy()` Also, - extract `_actor_child_main()` from `_child.py` `__main__` block as callable entry shape bc the subint needs it for code-string bootstrap - add `"subint"` to the `_runtime.py` spawn-method check so child accepts `SpawnSpec` over IPC Prompt-IO: ai/prompt-io/claude/20260417T124437Z_5cd6df5_prompt_io.md (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-20 16:37:21 -04:00
Gud Boi	2e9dbc5b12	Handle py3.14+ incompats as test skips Since we're devving subints we require the 3.14+ stdlib API and a couple compiled libs don't support it yet, namely: - `cffi`, which we're only using for the `.ipc._linux` eventfd stuff (now factored into `hotbaud` anyway). - `greenback`, which requires `greenlet` which doesn't seem to be wheeled yet * on nixos the sdist build was failing due to lack of `g++` which i don't care to figure out rn since we don't need `.devx` stuff immediately for this subints prototype. * [ ] we still need to adjust any dependent suites to skip. Adjust `test_ringbuf` to skip on import failure. Also project wide, - pin us to py 3.13+ in prep for last-2-minor-version policy. - drop `msgspec>=0.20.0`, the first release with py3.14 support.	2026-04-20 16:30:46 -04:00
Gud Boi	7cee62ce42	Add `'subint'` spawn backend scaffold (#379 ) Land the scaffolding for a future sub-interpreter (PEP 734 `concurrent.interpreters`) actor spawn backend per issue #379. The spawn flow itself is not yet implemented; `subint_proc()` raises a placeholder `NotImplementedError` pointing at the tracking issue — this commit only wires up the registry, the py-version gate, and the harness. Deats, - bump `pyproject.toml` `requires-python` to `>=3.12, <3.15` and list the `3.14` classifier — the new stdlib `concurrent.interpreters` module only ships on 3.14 - extend `SpawnMethodKey = Literal[..., 'subint']` - `try_set_start_method('subint')` grows a new `match` arm that feature-detects the stdlib module and raises `RuntimeError` with a clear banner on py<3.14 - `_methods` registers the new `subint_proc()` via the same bottom-of-module late-import pattern used for `._trio` / `._mp` Also, - new `tractor/spawn/_subint.py` — top-level `try: from concurrent import interpreters` guards `_has_subints: bool`; `subint_proc()` signature mirrors `trio_proc`/`mp_proc` so the Phase B.2 impl can drop in without touching the registry - re-add `import sys` to `_spawn.py` (needed for the py-version msg in the gate-error) - `_testing.pytest.pytest_configure` wraps `try_set_start_method()` in a `pytest.UsageError` handler so `--spawn-backend=subint` on py<3.14 prints a clean banner instead of a traceback (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-20 16:09:43 -04:00
Gud Boi	3773ad8b77	Pin `xonsh` to GH `main` in editable mode	2026-04-20 16:09:43 -04:00
Gud Boi	73e83e1474	Bump `xonsh` to latest pre `0.23` release	2026-04-20 16:08:05 -04:00
Gud Boi	355beac082	Expand `/run-tests` venv pre-flight to cover all cases Rework section 3 from a worktree-only check into a structured 3-step flow: detect active venv, interpret results (Case A: active, B: none, C: worktree), then run import + collection checks. Deats, - Case B prompts via `AskUserQuestion` when no venv is detected, offering `uv sync` or manual activate - add `uv run` fallback section for envs where venv activation isn't practical - new allowed-tools: `uv run python`, `uv run pytest`, `uv pip show`, `AskUserQuestion` (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-20 16:08:05 -04:00

1 2 3 4 5 ...

2595 Commits (a4d6318ca7c1798ea383b03efc1b89d5dfdbf1dd) All Branches Search

2595 Commits (a4d6318ca7c1798ea383b03efc1b89d5dfdbf1dd)

All Branches