Compare commits

..

53 Commits

Author SHA1 Message Date
Gud Boi a4d6318ca7 Claude-perms: ensure /commit-msg files can be written! 2026-04-23 16:48:06 -04:00
Gud Boi fe89169f1c Skip-mark + narrow `subint_forkserver` cancel hang
Two-part stopgap for the still-hanging
`test_nested_multierrors[subint_forkserver]`:

1. Skip-mark the test via
   `@pytest.mark.skipon_spawn_backend('subint_forkserver',
   reason=...)` so it stops blocking the test
   matrix while the remaining bug is being chased.
   The reason string cross-refs the conc-anal doc
   for full context.

2. Update the conc-anal doc
   (`subint_forkserver_test_cancellation_leak_issue.md`) with the
   empirical state after the three nested- cancel fix commits
   (`0cd0b633` FD scrub + `fe540d02` pidfd wait + `57935804` parent-chan
   shield break) landed, narrowing the remaining hang from "everything
   broken" to "peer-channel loops don't exit on `service_tn` cancel".

Deats from the DIAGDEBUG instrumentation pass,
- 80 `process_messages` ENTERs, 75 EXITs → 5 stuck
- ALL 40 `shield=True` ENTERs matched EXIT — the
  `_parent_chan_cs.cancel()` wiring from `57935804`
  works as intended for shielded loops.
- the 5 stuck loops are all `shield=False` peer-
  channel handlers in `handle_stream_from_peer`
  (inbound connections handled by
  `stream_handler_tn`, which IS `service_tn` in the
  current config).
- after `_parent_chan_cs.cancel()` fires, NEW
  shielded loops appear on the session reg_addr
  port — probably discovery-layer reconnection;
  doesn't block teardown but indicates the cascade
  has more moving parts than expected.

The remaining unknown: why don't the 5 peer-channel loops exit when
`service_tn.cancel_scope.cancel()` fires? They're not shielded, they're
inside the service_tn scope, a standard cancel should propagate through.
Some fork-config-specific divergence keeps them alive. Doc lists three
follow-up experiments (stackscope dump, side-by-side `trio_proc`
comparison, audit of the `tractor/ipc/_server.py:448` `except
trio.Cancelled:` path).

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-23 16:44:15 -04:00
Gud Boi 57935804e2 Break parent-chan shield during teardown
Completes the nested-cancel deadlock fix started in
0cd0b633 (fork-child FD scrub) and fe540d02 (pidfd-
cancellable wait). The remaining piece: the parent-
channel `process_messages` loop runs under
`shield=True` (so normal cancel cascades don't kill
it prematurely), and relies on EOF arriving when the
parent closes the socket to exit naturally.

Under exec-spawn backends (`trio_proc`, mp) that EOF
arrival is reliable — parent's teardown closes the
handler-task socket deterministically. But fork-
based backends like `subint_forkserver` share enough
process-image state that EOF delivery becomes racy:
the loop parks waiting for an EOF that only arrives
after the parent finishes its own teardown, but the
parent is itself blocked on `os.waitpid()` for THIS
actor's exit. Mutual wait → deadlock.

Deats,
- `async_main` stashes the cancel-scope returned by
  `root_tn.start(...)` for the parent-chan
  `process_messages` task onto the actor as
  `_parent_chan_cs`
- `Actor.cancel()`'s teardown path (after
  `ipc_server.cancel()` + `wait_for_shutdown()`)
  calls `self._parent_chan_cs.cancel()` to
  explicitly break the shield — no more waiting for
  EOF delivery, unwinding proceeds deterministically
  regardless of backend
- inline comments on both sites explain the mutual-
  wait deadlock + why the explicit cancel is
  backend-agnostic rather than a forkserver-specific
  workaround

With this + the prior two fixes, the
`subint_forkserver` nested-cancel cascade unwinds
cleanly end-to-end.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-23 16:27:38 -04:00
Gud Boi fe540d0228 Use `pidfd` for cancellable `_ForkedProc.wait`
Two coordinated improvements to the `subint_forkserver` backend:

1. Replace `trio.to_thread.run_sync(os.waitpid, ...,
   abandon_on_cancel=False)` in `_ForkedProc.wait()`
   with `trio.lowlevel.wait_readable(pidfd)`. The
   prior version blocked a trio cache thread on a
   sync syscall — outer cancel scopes couldn't
   unwedge it when something downstream got stuck.
   Same pattern `trio.Process.wait()` and
   `proc_waiter` (the mp backend) already use.

2. Drop the `@pytest.mark.xfail(strict=True)` from
   `test_orphaned_subactor_sigint_cleanup_DRAFT` —
   the test now PASSES after 0cd0b633 (fork-child
   FD scrub). Same root cause as the nested-cancel
   hang: inherited IPC/trio FDs were poisoning the
   child's event loop. Closing them lets SIGINT
   propagation work as designed.

Deats,
- `_ForkedProc.__init__` opens a pidfd via
  `os.pidfd_open(pid)` (Linux 5.3+, Python 3.9+)
- `wait()` parks on `trio.lowlevel.wait_readable()`,
  then non-blocking `waitpid(WNOHANG)` to collect
  the exit status (correct since the pidfd signal
  IS the child-exit notification)
- `ChildProcessError` swallow handles the rare race
  where someone else reaps first
- pidfd closed after `wait()` completes (one-shot
  semantics) + `__del__` belt-and-braces for
  unexpected-teardown paths
- test docstring's `@xfail` block replaced with a
  `# NOTE` comment explaining the historical
  context + cross-ref to the conc-anal doc; test
  remains in place as a regression guard

The two changes are interdependent — the
cancellable `wait()` matters for the same nested-
cancel scenarios the FD scrub fixes, since the
original deadlock had trio cache workers wedged in
`os.waitpid` swallowing the outer cancel.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-23 16:06:45 -04:00
Gud Boi 0cd0b633f1 Scrub inherited FDs in fork-child prelude
Implements fix-direction (1)/blunt-close-all-FDs from
b71705bd (`subint_forkserver` nested-cancel hang
diag), targeting the multi-level cancel-cascade
deadlock in
`test_nested_multierrors[subint_forkserver]`.

The diagnosis doc voted for surgical FD cleanup via
`actor.ipc_server` handle as the cleanest approach,
but going blunt is actually the right call: after
`os.fork()`, the child immediately enters
`_actor_child_main()` which opens its OWN IPC
sockets / wakeup-fd / epoll-fd / etc. — none of the
parent's FDs are needed. Closing everything except
stdio is safe AND defends against future
listener/IPC additions to the parent inheriting
silently into children.

Deats,
- new `_close_inherited_fds(keep={0,1,2}) -> int`
  helper. Linux fast-path enumerates `/proc/self/fd`;
  POSIX fallback uses `RLIMIT_NOFILE` range. Matches
  the stdlib `subprocess._posixsubprocess.close_fds`
  strategy. Returns close-count for sanity logging
- wire into `fork_from_worker_thread._worker()`'s
  post-fork child prelude — runs immediately after
  the pid-pipe `os.close(rfd/wfd)`, before the user
  `child_target` callable executes
- docstring cross-refs the diagnosis doc + spells
  out the FD-inheritance-cascade mechanism and why
  the close-all approach is safe for our spawn shape

Validation pending: re-run `test_nested_multierrors[subint_forkserver]`
to confirm the deadlock is gone.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-23 15:30:39 -04:00
Gud Boi b71705bdcd Refine `subint_forkserver` nested-cancel hang diagnosis
Major rewrite of
`subint_forkserver_test_cancellation_leak_issue.md`
after empirical investigation revealed the earlier
"descendant-leak + missing tree-kill" diagnosis
conflated two unrelated symptoms:

1. **5-zombie leak holding `:1616`** — turned out to
   be a self-inflicted cleanup bug: `pkill`-ing a bg
   pytest task (SIGTERM/SIGKILL, no SIGINT) skipped
   the SC graceful cancel cascade entirely. Codified
   the real fix — SIGINT-first ladder w/ bounded
   wait before SIGKILL — in e5e2afb5 (`run-tests`
   SKILL) and
   `feedback_sc_graceful_cancel_first.md`.
2. **`test_nested_multierrors[subint_forkserver]`
   hangs indefinitely** — the actual backend bug,
   and it's a deadlock not a leak.

Deats,
- new diagnosis: all 5 procs are kernel-`S` in
  `do_epoll_wait`; pytest-main's trio-cache workers
  are in `os.waitpid` waiting for children that are
  themselves waiting on IPC that never arrives —
  graceful `Portal.cancel_actor` cascade never
  reaches its targets
- tree-structure evidence: asymmetric depth across
  two identical `run_in_actor` calls — child 1
  (3 threads) spawns both its grandchildren; child 2
  (1 thread) never completes its first nursery
  `run_in_actor`. Smells like a race on fork-
  inherited state landing differently per spawn
  ordering
- new hypothesis: `os.fork()` from a subactor
  inherits the ROOT parent's IPC listener FDs
  transitively. Grandchildren end up with three
  overlapping FD sets (own + direct-parent + root),
  so IPC routing becomes ambiguous. Predicts bug
  scales with fork depth — matches reality: single-
  level spawn works, multi-level hangs
- ruled out: `_ForkedProc.kill()` tree-kill (never
  reaches hard-kill path), `:1616` contention (fixed
  by `reg_addr` fixture wiring), GIL starvation
  (each subactor has its own OS process+GIL),
  child-side KBI absorption (`_trio_main` only
  catches KBI at `trio.run()` callsite, reached
  only on trio-loop exit)
- four fix directions ranked: (1) blanket post-fork
  `closerange()`, (2) `FD_CLOEXEC` + audit,
  (3) targeted FD cleanup via `actor.ipc_server`
  handle, (4) `os.posix_spawn` w/ `file_actions`.
  Vote: (3) — surgical, doesn't break the "no exec"
  design of `subint_forkserver`
- standalone repro added (`spawn_and_error(breadth=
  2, depth=1)` under `trio.fail_after(20)`)
- stopgap: skip `test_nested_multierrors` + multi-
  level-spawn tests under the backend via
  `@pytest.mark.skipon_spawn_backend(...)` until
  fix lands

Killing the "tree-kill descendants" fix-direction
section: it addressed a bug that didn't exist.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-23 15:21:41 -04:00
Gud Boi e5e2afb5f4 Use SIGINT-first ladder in `run-tests` cleanup
The previous cleanup recipe went straight to
SIGTERM+SIGKILL, which hides bugs: tractor is
structured concurrent — `_trio_main` catches SIGINT
as an OS-cancel and cascades `Portal.cancel_actor`
over IPC to every descendant. So a graceful SIGINT
exercises the actual SC teardown path; if it hangs,
that's a real bug to file (the forkserver `:1616`
zombie was originally suspected to be one of these
but turned out to be a teardown gap in
`_ForkedProc.kill()` instead).

Deats,
- step 1: `pkill -INT` scoped to `$(pwd)/py*` — no
  sleep yet, just send the signal
- step 2: bounded wait loop (10 × 0.3s = ~3s) using
  `pgrep` to poll for exit. Loop breaks early on
  clean exit
- step 3: `pkill -9` only if graceful timed out, w/
  a logged escalation msg so it's obvious when SC
  teardown didn't complete
- step 4: same SIGINT-first ladder for the rare
  `:1616`-holding zombie that doesn't match the
  cmdline pattern (find PID via `ss -tlnp`, then
  `kill -INT NNNN; sleep 1; kill -9 NNNN`)
- steps 5-6: UDS-socket `rm -f` + re-verify
  unchanged

Goal: surface real teardown bugs through the test-
cleanup workflow instead of papering over them with
`-9`.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-23 14:46:16 -04:00
Gud Boi de6016763f Wire `reg_addr` through leaky cancel tests
Stopgap companion to d0121960 (`subint_forkserver`
test-cancellation leak doc): five tests in
`tests/test_cancellation.py` were running against the
default `:1616` registry, so any leaked
`subint-forkserv` descendant from a prior test holds
the port and blows up every subsequent run with
`TooSlowError` / "address in use". Thread the
session-unique `reg_addr` fixture through so each run
picks its own port — zombies can no longer poison
other tests (they'll only cross-contaminate whatever
happens to share their port, which is now nothing).

Deats,
- add `reg_addr: tuple` fixture param to:
  - `test_cancel_infinite_streamer`
  - `test_some_cancels_all`
  - `test_nested_multierrors`
  - `test_cancel_via_SIGINT`
  - `test_cancel_via_SIGINT_other_task`
- explicitly pass `registry_addrs=[reg_addr]` to the
  two `open_nursery()` calls that previously had no
  kwargs at all (in `test_cancel_via_SIGINT` and
  `test_cancel_via_SIGINT_other_task`)
- add bounded `@pytest.mark.timeout(7, method='thread')`
  to `test_nested_multierrors` so a hung run doesn't
  wedge the whole session

Still doesn't close the real leak — the
`subint_forkserver` backend's `_ForkedProc.kill()` is
PID-scoped not tree-scoped, so grandchildren survive
teardown regardless of registry port. This commit is
just blast-radius containment until that fix lands.
See `ai/conc-anal/
subint_forkserver_test_cancellation_leak_issue.md`.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-23 14:37:48 -04:00
Gud Boi d0121960b9 Add `subint_forkserver` test-cancellation leak doc
New `ai/conc-anal/
subint_forkserver_test_cancellation_leak_issue.md`
captures a descendant-leak surfaced while wiring
`subint_forkserver` into the full test matrix:
running `tests/test_cancellation.py` under
`--spawn-backend=subint_forkserver` reproducibly
leaks **exactly 5** `subint-forkserv` comm-named
child processes that survive session exit, each
holding a `LISTEN` on `:1616` (the tractor default
registry addr) — and therefore poisons every
subsequent test session that defaults to that addr.

Deats,
- TL;DR + ruled-out checks confirming the procs are
  ours (not piker / other tractor-embedding apps) —
  `/proc/$pid/cmdline` + cwd both resolve to this
  repo's `py314/` venv
- root cause: `_ForkedProc.kill()` is PID-scoped
  (plain `os.kill(SIGKILL)` to the direct child),
  not tree-scoped — grandchildren spawned during a
  multi-level cancel test get reparented to init and
  inherit the registry listen socket
- proposed fix directions ranked: (1) put each
  forkserver-spawned subactor in its own process-
  group (`os.setpgrp()` in fork-child) + tree-kill
  via `os.killpg(pgid, SIGKILL)` on teardown,
  (2) `PR_SET_CHILD_SUBREAPER` on root, (3) explicit
  `/proc/<pid>/task/*/children` walk. Vote: (1) —
  POSIX-standard, aligns w/ `start_new_session=True`
  semantics in `subprocess.Popen` / trio's
  `open_process`
- inline reproducer + cleanup recipe scoped to
  `$(pwd)/py314/bin/python.*pytest.*spawn-backend=
  subint_forkserver` so cleanup doesn't false-flag
  unrelated tractor procs (consistent w/
  `run-tests` skill's zombie-check guidance)

Stopgap hygiene fix (wiring `reg_addr` through the 5
leaky tests in `test_cancellation.py`) is incoming as
a follow-up — that one stops the blast radius, but
zombies still accumulate per-run until the real
tree-kill fix lands.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-23 13:58:42 -04:00
Gud Boi aa7450974c Add zombie-actor check to `run-tests` skill
Fork-based backends (esp. `subint_forkserver`) can
leak child actor processes on cancelled / SIGINT'd
test runs; the zombies keep the tractor default
registry (`127.0.0.1:1616` / `/tmp/registry@1616.sock`)
bound, so every subsequent session can't bind and
50+ unrelated tests fail with the same
`TooSlowError` / "address in use" signature. Document
the pre-flight + post-cancel check as a mandatory
step 4.

Deats,
- **primary signal**: `ss -tlnp | grep ':1616'` for a
  bound TCP registry listener — the authoritative
  check since :1616 is unique to our runtime
- `pgrep -af` scoped to `$(pwd)/py[0-9]*/bin/python.*
  _actor_child_main|subint-forkserv` for leftover
  actor/forkserver procs — scoped deliberately so we
  don't false-flag legit long-running tractor-
  embedding apps like `piker`
- `ls /tmp/registry@*.sock` for stale UDS sockets
- scoped cleanup recipe (SIGTERM + SIGKILL sweep
  using the same `$(pwd)/py*` pattern, UDS `rm -f`,
  re-verify) plus a fallback for when a zombie holds
  :1616 but doesn't match the pattern: `ss -tlnp` →
  kill by PID
- explicit false-positive warning calling out the
  `piker` case (`~/repos/piker/py*/bin/python3 -m
  tractor._child ...`) so a bare `pgrep` doesn't lead
  to nuking unrelated apps

Goal: short-circuit the "spelunking into test code"
rabbit-hole when the real cause is just a leaked PID
from a prior session, without collateral damage to
other tractor-embedding projects on the same box.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-23 13:43:57 -04:00
Gud Boi 6720c3bbb3 Mv `test_subint_cancellation.py` to `tests/spawn/` subpkg
Also, some slight touchups in `.spawn._subint`.
2026-04-23 11:50:19 -04:00
Gud Boi 4425023500 Label forkserver child as `subint_forkserver`
Follow-up to 72d1b901 (was prev commit adding `debug_mode` for
`subint_forkserver`): that commit wired the runtime-side
`subint_forkserver` SpawnSpec-recv gate in `Actor._from_parent`, but the
`subint_forkserver_proc` child-target was still passing
`spawn_method='trio'` to `_trio_main` — so `Actor.pformat()` / log lines
would report the subactor as plain `'trio'` instead of the actual
parent-side spawn mechanism. Flip the label to `'subint_forkserver'`.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-23 11:43:06 -04:00
Gud Boi 72d1b9016a Enable `debug_mode` for `subint_forkserver`
The `subint_forkserver` backend's child runtime is trio-native (uses
`_trio_main` + receives `SpawnSpec` over IPC just like `trio`/`subint`),
so `tractor.devx.debug._tty_lock` works in those subactors. Wire the
runtime gates that historically hard-coded `_spawn_method == 'trio'` to
recognize this third backend.

Deats,
- new `_DEBUG_COMPATIBLE_BACKENDS` module-const in `tractor._root`
  listing the spawn backends whose subactor runtime is trio-native
  (`'trio'`, `'subint_forkserver'`). Both the enable-site
  (`_runtime_vars['_debug_mode'] = True`) and the cleanup-site reset
  key.
  off the same tuple — keep them in lockstep when adding backends
- `open_root_actor`'s `RuntimeError` for unsupported backends now
  reports the full compatible-set + the rejected method instead of the
  stale "only `trio`" msg.
- `runtime._runtime.Actor._from_parent`'s SpawnSpec-recv gate adds
  `'subint_forkserver'` to the existing `('trio', 'subint')` tuple
  — fork child-side runtime receives the same SpawnSpec IPC handshake as
  the others.
- `subint_forkserver_proc` child-target now passes
  `spawn_method='subint_forkserver'` (was hard-coded `'trio'`) so
  `Actor.pformat()` / log lines reflect the actual parent-side spawn
  mechanism rather than masquerading as plain `trio`.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-23 11:39:42 -04:00
Gud Boi 4227eabcba Drop unneeded f-str prefixes 2026-04-23 11:04:10 -04:00
Gud Boi af94080b21 Shorten some timeouts in `subint_forkserver` suites 2026-04-23 11:01:56 -04:00
Gud Boi 36cca96385 Refine `subint_forkserver` orphan-SIGINT diagnosis
Empirical follow-up to the xfail'd orphan-SIGINT test:
the hang is **not** "trio can't install a handler on a
non-main thread" (the original hypothesis from the
`child_sigint` scaffold commit). On py3.14:

- `threading.current_thread() is threading.main_thread()`
  IS True post-fork — CPython re-designates the
  fork-inheriting thread as "main" correctly
- trio's `KIManager` SIGINT handler IS installed in the
  subactor (`signal.getsignal(SIGINT)` confirms)
- the kernel DOES deliver SIGINT to the thread

But `faulthandler` dumps show the subactor wedged in
`trio/_core/_io_epoll.py::get_events` — trio's
wakeup-fd mechanism (which turns SIGINT into an epoll-wake)
isn't firing. So the `except KeyboardInterrupt` at
`tractor/spawn/_entry.py::_trio_main:164` — the runtime's
intentional "KBI-as-OS-cancel" path — never fires.

Deats,
- new `ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md`
  (+385 LOC): full writeup — TL;DR, symptom reproducer,
  the "intentional cancel path" the bug defeats,
  diagnostic evidence (`faulthandler` output +
  `getsignal` probe), ruled-out hypotheses
  (non-main-thread issue, wakeup-fd inheritance,
  KBI-as-trio-check-exception), and fix directions
- `test_orphaned_subactor_sigint_cleanup_DRAFT` xfail
  `reason` + test docstring rewritten to match the
  refined understanding — old wording blamed the
  non-main-thread path, new wording points at the
  `epoll_wait` wedge + cross-refs the new conc-anal doc
- `_subint_forkserver` module docstring's
  `child_sigint='trio'` bullet updated: now notes trio's
  handler is already correctly installed, so the flag may
  end up a no-op / doc-only mode once the real root cause
  is fixed

Closing the gap aligns with existing design intent (make
the already-designed "KBI-as-OS-cancel" behavior actually
fire), not a new feature.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-23 09:31:32 -04:00
Gud Boi 8a3f94ace0 Scaffold `child_sigint` modes for forkserver
Add configuration surface for future child-side SIGINT
plumbing in `subint_forkserver_proc` without wiring up the
actual trio-native SIGINT bridge — lifting one entry-guard
clause will flip the `'trio'` branch live once the
underlying fork-prelude plumbing is implemented.

Deats,
- new `ChildSigintMode = Literal['ipc', 'trio']` type +
  `_DEFAULT_CHILD_SIGINT = 'ipc'` module-level default.
  Docstring block enumerates both:
  - `'ipc'` (default, currently the only implemented mode):
    no child-side SIGINT handler — `trio.run()` is on the
    fork-inherited non-main thread where
    `signal.set_wakeup_fd()` is main-thread-only, so
    cancellation flows exclusively via the parent's
    `Portal.cancel_actor()` IPC path. Known gap: orphan
    children don't respond to SIGINT
    (`test_orphaned_subactor_sigint_cleanup_DRAFT`)
  - `'trio'` (scaffolded only): manual SIGINT → trio-cancel
    bridge in the fork-child prelude so external Ctrl-C
    reaches stuck grandchildren even w/ a dead parent
- `subint_forkserver_proc` pulls `child_sigint` out of
  `proc_kwargs` (matches how `trio_proc` threads config to
  `open_process`, keeps `start_actor(proc_kwargs=...)` as
  the ergonomic entry point); validates membership + raises
  `NotImplementedError` for `'trio'` at the backend-entry
  guard
- `_child_target` grows a `match child_sigint:` arm that
  slots in the future `'trio'` impl without restructuring
  — today only the `'ipc'` case is reachable
- module docstring "Still-open work" list grows a bullet
  pointing at this config + the xfail'd orphan-SIGINT test

No behavioral change on the default path — `'ipc'` is the
existing flow. Scaffolding only.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-22 20:08:30 -04:00
Gud Boi 253e7cbd1c Add DRAFT `subint_forkserver` orphan-SIGINT test
Tier-4 test `test_orphaned_subactor_sigint_cleanup_DRAFT`
documents an empirical SIGINT-delivery gap in the
`subint_forkserver` backend: when the parent dies via
`SIGKILL` (no IPC `Portal.cancel_actor()` possible) and
`SIGINT` is sent to the orphan child, the child DOES NOT
unwind — CPython's default `KeyboardInterrupt` is delivered
to `threading.main_thread()`, whose tstate is dead in the
post-fork child bc fork inherited the worker thread, not
main. Trio running on the fork-inherited worker thread
therefore never observes the signal. Marked
`xfail(strict=True)` so the mark flips to XPASS→fail once
the backend grows explicit SIGINT plumbing.

Deats,
- harness runs the failure-mode sequence out-of-process:
  1. harness subprocess runs a fresh Python script
     that calls `try_set_start_method('subint_forkserver')`
     then opens a root actor + one `sleep_forever` subactor
  2. parse `PARENT_READY=<pid>` + `CHILD_PID=<pid>` markers
     off harness `stdout` to confirm IPC handshake
     completed
  3. `SIGKILL` the parent, `proc.wait()` to reap the
     zombie (otherwise `os.kill(pid, 0)` keeps reporting
     it alive)
  4. assert the child survived the parent-reap (i.e. was
     actually orphaned, not reaped too) before moving on
  5. `SIGINT` the orphan child, poll `os.kill(child_pid, 0)`
     every 100ms for up to 10s
- supporting helpers: `_read_marker()` with per-proc
  bytes-buffer to carry partial lines across calls,
  `_process_alive()` liveness probe via `kill(pid, 0)`
- Linux-only via `platform.system() != 'Linux'` skip —
  orphan-reparenting semantics don't generalize to
  other platforms
- port offset (`reg_addr[1] + 17`) so the harness listener
  doesn't race concurrently-running backend tests
- best-effort `finally:` cleanup: `SIGKILL` any still-alive
  pids + `proc.kill()` + bounded `proc.wait()` to avoid
  leaking orphans across the session

Also, tier-4 header comment documents the cross-backend
generalization path: applicable to any multi-process
backend (`trio`, `mp_spawn`, `mp_forkserver`,
`subint_forkserver`), NOT to plain `subint` (in-process
subints have no orphan OS-child). Move path: lift
harness into `tests/_orphan_harness.py`, parametrize on
session `_spawn_method`, add
`skipif _spawn_method == 'subint'`.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-22 19:39:41 -04:00
Gud Boi 27cc02e83d Refactor `_runtime_vars` into pure get/set API
Post-fork `_runtime_vars` reset in `subint_forkserver_proc`
was previously done via direct mutation of
`_state._runtime_vars` from an external module + an inline
default dict duplicating the `_state.py`-internal defaults.
Split the access surface into a pure getter + explicit
setter so the reset call site becomes a one-liner
composition.

Deats `tractor/runtime/_state.py`,
- extract initial values into a module-level
  `_RUNTIME_VARS_DEFAULTS: dict[str, Any]` constant; the
  live `_runtime_vars` is now initialised from
  `dict(_RUNTIME_VARS_DEFAULTS)`
- `get_runtime_vars()` grows a `clear_values: bool = False`
  kwarg. When True, returns a fresh copy of
  `_RUNTIME_VARS_DEFAULTS` instead of the live dict —
  still a **pure read**, never mutates anything
- new `set_runtime_vars(rtvars: dict | RuntimeVars)` —
  atomic replacement of the live dict's contents via
  `.clear()` + `.update()`, so existing references to the
  same dict object remain valid. Accepts either the
  historical dict form or the `RuntimeVars` struct

Deats `tractor/spawn/_subint_forkserver.py`,
- collapse the prior ad-hoc `.update({...})` block into
  `set_runtime_vars(get_runtime_vars(clear_values=True))`
- drop the `_state._current_actor = None` line —
  `_trio_main` unconditionally overwrites it downstream,
  so no explicit reset needed (noted in the XXX comment)

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-22 19:10:06 -04:00
Gud Boi 66dda9e449 Reset post-fork `_state` in forkserver child
`os.fork()` inherits the parent's entire memory image,
including `tractor.runtime._state` globals that encode
"this process is the root actor" — `_runtime_vars`'s
`_is_root=True`, pre-populated `_root_mailbox` +
`_registry_addrs`, and the parent's `_current_actor`
singleton.

A fresh `exec`-based child starts with those globals at
their module-level defaults (all falsey/empty). The
forkserver child needs to match that shape BEFORE calling
`_actor_child_main()`, otherwise `Actor.__init__()` takes
the `is_root_process() == True` branch and pre-populates
`self.enable_modules`, which then trips
`assert not self.enable_modules` at the top of
`Actor._from_parent()` on the subsequent parent→child
`SpawnSpec` handshake.

Fix: at the start of `_child_target`, null
`_state._current_actor` and overwrite `_runtime_vars` with
a cold-root blank (`_is_root=False`, empty mailbox/addrs,
`_debug_mode=False`) before `_actor_child_main()` runs.

Found-via: `test_subint_forkserver_spawn_basic` hitting
the `enable_modules` assert on child-side runtime boot.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-22 19:01:27 -04:00
Gud Boi 43bd6a6410 Wire `subint_forkserver` as first-class backend
Promote `_subint_forkserver` from primitives-only into a
registered spawn backend: `'subint_forkserver'` is now a
`SpawnMethodKey` literal, dispatched via `_methods` to
the new `subint_forkserver_proc()` target, feature-gated
under the existing `subint`-family py3.14+ case, and
selectable via `--spawn-backend=subint_forkserver`.

Deats,
- new `subint_forkserver_proc()` spawn target in
  `_subint_forkserver`:
  - mirrors `trio_proc()`'s supervision model — real OS
    subprocess so `Portal.cancel_actor()` + `soft_kill()`
    on graceful teardown, `os.kill(SIGKILL)` on hard-reap
    (no `_interpreters.destroy()` race to fuss over bc the
    child lives in its own process)
  - only real diff from `trio_proc` is the spawn mechanism:
    fork from a main-interp worker thread via
    `fork_from_worker_thread()` (off-loaded to trio's
    thread pool) instead of `trio.lowlevel.open_process()`
  - child-side `_child_target` closure runs
    `tractor._child._actor_child_main()` with
    `spawn_method='trio'` — the child is a regular trio
    actor, "subint_forkserver" names how the parent
    spawned, not what the child runs
- new `_ForkedProc` class — thin `trio.Process`-compatible
  shim around a raw OS pid: `.poll()` via
  `waitpid(WNOHANG)`, async `.wait()` off-loaded to a trio
  cache thread, `.kill()` via `SIGKILL`, `.returncode`
  cached for repeat calls. `.stdin`/`.stdout`/`.stderr`
  are `None` (fork-w/o-exec inherits parent FDs; we don't
  marshal them) which matches `soft_kill()`'s `is not None`
  guards

Also, new backend-tier test
`test_subint_forkserver_spawn_basic` drives the registered
backend end-to-end via `open_root_actor` + `open_nursery` +
`run_in_actor` w/ a trivial portal-RPC round-trip. Uses a
`forkserver_spawn_method` fixture to flip
`_spawn_method`/`_ctx` for the test's duration + restore on
teardown (so other session-level tests don't observe the
global flip). Test module docstring reworked to describe
the three tiers now covered: (1) primitive-level, (2)
parent-trio-driven primitives, (3) full registered backend.

Status: still-open work (tracked on `tractor#379`) doc'd
inline in the module docstring — no cancel/hard-kill stress
coverage yet, child-side subint-hosted root runtime still
future (gated on `msgspec#563`), thread-hygiene audit
pending the same unblock.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-22 18:49:23 -04:00
Gud Boi 5bd5f957d3 Add `subint_forkserver` PEP 684 audit-plan doc
Follow-up tracker companion to the module-docstring TODO
added in `372a0f32`. Catalogs why `_subint_forkserver`'s
two "non-trio thread" constraints
(`fork_from_worker_thread()` +
`run_subint_in_worker_thread()` both allocating dedicated
`threading.Thread`s; test helper named
`run_fork_in_non_trio_thread`) exist today, and which of
them would dissolve once msgspec PEP 684 support ships
(`msgspec#563`) and tractor flips to isolated-mode subints.

Deats,
- three reasons enumerated for the current constraints:
  - class-A GIL-starvation — **fixed** by isolated mode:
    subints don't share main's GIL so abandoned-thread
    contention disappears
  - destroy race / tstate-recycling from `subint_proc` —
    **unclear**: `_PyXI_Enter` + `_PyXI_Exit` are
    cross-mode, so isolated doesn't obviously fix it;
    needs empirical retest on py3.14 + isolated API
  - fork-from-main-interp-tstate (the CPython-level
    `_PyInterpreterState_DeleteExceptMain` gate) — the
    narrow reason for using a dedicated thread; **probably
    fixed** IF the destroy-race also resolves (bc trio's
    cache threads never drove subints → clean main-interp
    tstate)
- TL;DR table of which constraints unwind under each
  resolution branch
- four-step audit plan for when `msgspec#563` lands:
  - flip `_subint` to isolated mode
  - empirical destroy-race retest
  - audit `_subint_forkserver.py` — drop `non_trio`
    qualifier / maybe inline primitives
  - doc fallout — close the three `subint_*_issue.md`
    siblings w/ post-mortem notes

Also, cross-refs the three sibling `conc-anal/` docs, PEPs
684 + 734, `msgspec#563`, and `tractor#379` (the overall
subint spawn-backend tracking issue).

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-22 18:18:30 -04:00
Gud Boi 1d4867e51c Add trio-parent tests for `_subint_forkserver`
New pytest module `tests/spawn/test_subint_forkserver.py`
drives the forkserver primitives from inside a real
`trio.run()` in the parent — the runtime shape tractor will
actually use when we wire up a `subint_forkserver` spawn
backend proper. Complements the standalone no-trio-in-parent
`ai/conc-anal/subint_fork_from_main_thread_smoketest.py`.

Deats,
- new test pkg `tests/spawn/` (+ empty `__init__.py`)
- two tests, both `@pytest.mark.timeout(30, method='thread')`
  for the GIL-hostage safety reason doc'd in
  `ai/conc-anal/subint_sigint_starvation_issue.md`:
  - `test_fork_from_worker_thread_via_trio` — parent-side
    plumbing baseline. `trio.run()` off-loads forkserver
    prims via `trio.to_thread.run_sync()` + asserts the
    child reaps cleanly
  - `test_fork_and_run_trio_in_child` — end-to-end: forked
    child calls `run_subint_in_worker_thread()` with a
    bootstrap str that does `trio.run()` in a fresh subint
- both tests wrap the inner `trio.run()` in a
  `dump_on_hang()` for post-mortem if the outer
  `pytest-timeout` fires
- intentionally NOT using `--spawn-backend` — the tests
  drive the primitives directly rather than going through
  tractor's spawn-method registry (which the forkserver
  isn't plugged into yet)

Also, rename `run_trio_in_subint()` →
`run_subint_in_worker_thread()` for naming consistency with
the sibling `fork_from_worker_thread()`. The action is really
"host a subint on a worker thread", not specifically "run
trio" — trio just happens to be the typical payload.
Propagate the rename to the smoketest.

Further, add a "TODO — cleanup gated on msgspec PEP 684
support" section to the `_subint_forkserver` module
docstring: flags the dedicated-`threading.Thread` design as
potentially-revisable once isolated-mode subints are viable
in tractor. Cross-refs `msgspec#563` + `tractor#379` and
points at an audit-plan conc-anal doc we'll add next.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-22 18:00:06 -04:00
Gud Boi 372a0f3247 Lift fork prims into `_subint_forkserver` mod
The smoketest (prior commit) empirically validated the
"fork-from-main-interp-worker-thread" arch on py3.14. Promote
the validated primitives out of the `ai/conc-anal/` smoketest
into `tractor.spawn._subint_forkserver` so they can eventually
be wired into a real "subint forkserver" spawn backend.

Deats,
- new module `tractor/spawn/_subint_forkserver.py` (337 LOC):
  - `fork_from_worker_thread(child_target, thread_name)` —
    spawn a main-interp `threading.Thread`, call `os.fork()`
    from it, shuttle the child pid back to main via a pipe
  - `run_trio_in_subint(bootstrap, ...)` — post-fork helper:
    create a fresh subint + drive `_interpreters.exec()` on
    a dedicated worker thread running the `bootstrap` str
    (typically imports `trio`, defines an async entry, calls
    `trio.run()`)
  - `wait_child(pid, expect_exit_ok)` — `os.waitpid()` +
    pass/fail classification reusable from harness AND the
    eventual real spawn path
- feature-gated py3.14+ via the public
  `concurrent.interpreters` presence check; matches the gate
  in `tractor.spawn._subint`
- module docstring doc's the CPython-block context
  (cross-refs `_subint_fork` stub + the two `conc-anal/`
  docs) and status: EXPERIMENTAL, not yet registered in
  `_spawn._methods`

Also, refactor the smoketest
`ai/conc-anal/subint_fork_from_main_thread_smoketest.py` to
import the primitives from the new module rather than inline
its own copies. Keeps the smoketest and the tractor-side
impl in sync as the forkserver design evolves; the smoketest
remains a zero-`tractor`-runtime CPython-level check
(imports ONLY the three primitives, no runtime bring-up).

Status: next step is to drive these from a parent-side
`trio.run()` and hook the returned child pid into the normal
actor-nursery/IPC flow — then register `subint_forkserver`
as a `SpawnMethodKey` in `_spawn.py`.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-22 17:03:15 -04:00
Gud Boi 37cfaa202b Add CPython-level `subint_fork` workaround smoketest
Standalone script to validate the "main-interp worker-thread
forkserver + subint-hosted trio" arch proposed as a workaround
to the CPython-level refusal doc'd in
`ai/conc-anal/subint_fork_blocked_by_cpython_post_fork_issue.md`.

Deliberately NOT a `tractor` test — zero `tractor` imports.
Uses `_interpreters` (private stdlib) + `os.fork()` directly so
pass/fail is a property of CPython alone, independent of our
runtime. Requires py3.14+.

Deats,
- four scenarios via `--scenario`:
  - `control_subint_thread_fork` — the KNOWN-BROKEN case as a
    harness sanity; if the child DOESN'T abort, our analysis
    is wrong
  - `main_thread_fork` — baseline sanity, must always succeed
  - `worker_thread_fork` — architectural assertion: regular
    `threading.Thread` attached to main interp calls
    `os.fork()`; child should survive post-fork cleanup
  - `full_architecture` — end-to-end: fork from a main-interp
    worker thread, then in child create a subint driving a
    worker thread running `trio.run()`
- exit code 0 on EXPECTED outcome (for `control_*` that means
  "child aborted", not "child succeeded")
- each scenario prints a self-contained pass/fail banner; use
  `os.waitpid()` of the parent + per-scenario status prints to
  observe the child's fate

Also, log NLNet provenance for this session's three-sub-phase
work (py3.13 gate tightening, `pytest-timeout` + marker
refactor, `subint_fork` prototype → CPython-block finding).

Prompt-IO: ai/prompt-io/claude/20260422T200723Z_797f57c_prompt_io.md

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-22 16:40:52 -04:00
Gud Boi 797f57ce7b Doc `subint_fork` as blocked by CPython post-fork
Empirical finding: the WIP `subint_fork_proc` scaffold
landed in `cf0e3e6f` does *not* work on current CPython.
The `fork()` syscall succeeds in the parent, but the
CHILD aborts immediately during
`PyOS_AfterFork_Child()` →
`_PyInterpreterState_DeleteExceptMain()`, which gates
on the current tstate belonging to the main interp —
the child dies with `Fatal Python error: not main
interpreter`.

CPython devs acknowledge the fragility with an in-source
comment (`// Ideally we could guarantee tstate is running
main.`) but expose no user-facing hook to satisfy the
precondition — so the strategy is structurally dead until
upstream changes.

Rather than delete the scaffold, reshape it into a
documented dead-end so the next person with this idea
lands on the reason rather than rediscovering the same
CPython-level refusal.

Deats,
- Move `subint_fork_proc` out of `tractor.spawn._subint`
  into a new `tractor.spawn._subint_fork` dedicated
  module (153 LOC). Module + fn docstrings now describe
  the blockage directly; the fn body is trimmed to a
  `NotImplementedError` pointing at the analysis doc —
  no more dead-code `bootstrap` sketch bloating
  `_subint.py`.
- `_spawn.py`: keep `'subint_fork'` in `SpawnMethodKey`
  + the `_methods` dispatch so
  `--spawn-backend=subint_fork` routes to a clean
  `NotImplementedError` rather than "invalid backend";
  comment calls out the blockage. Collapse the duplicate
  py3.14 feature-gate in `try_set_start_method()` into a
  combined `case 'subint' | 'subint_fork':` arm.
- New 337-line analysis:
  `ai/conc-anal/subint_fork_blocked_by_cpython_post_fork_issue.md`.
  Annotated walkthrough from the user-visible fatal
  error down to the specific `Modules/posixmodule.c` +
  `Python/pystate.c` source lines enforcing the refusal,
  plus an upstream-report draft.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-22 16:02:01 -04:00
Gud Boi cf0e3e6f8b Add WIP `subint_fork_proc` backend scaffold
Experimental third spawn backend: use a fresh
sub-interpreter purely as a trio-free launchpad from
which to `os.fork()` + exec back into
`python -m tractor._child`. Per issue #379's
"fork()-workaround/hacks" thread.

Intent is to sidestep both,
- the trio+fork hazards hitting `trio_proc` (python- trio/trio#1614 et
  al.), since the forking interp is guaranteed trio-free.

- the shared-GIL abandoned-thread hazards hitting `subint_proc`
  (`ai/conc-anal/subint_sigint_starvation_issue.md`), since we don't
  *stay* in the subint — it only lives long enough to call `os.fork()`

Downstream of the fork+exec, all the existing `trio_proc` plumbing is
reused verbatim: `ipc_server.wait_for_peer()`, `SpawnSpec`, `Portal`
yield, soft-kill.

Status: NOT wired up beyond scaffolding. The fn raises
`NotImplementedError` immediately; the `bootstrap` fork/exec string
builder and the `# TODO: orchestrate driver thread` block are kept
in-tree as deliberate dead code so the next iteration starts from
a concrete shape rather than a blank page.

Docstring calls out three open questions that need
empirical validation before wiring this up:
1. Does CPython permit `os.fork()` from a non-main
   legacy subint?
2. Can the child stay fork-without-exec and
   `trio.run()` directly from within the launchpad
   subint?
3. How do `signal.set_wakeup_fd()` handlers and other
   process-global state interact when the forking
   thread is inside a subint?

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-22 13:32:39 -04:00
Gud Boi 99d70337b7 Mark `subint`-hanging tests with `skipon_spawn_backend`
Adopt the `@pytest.mark.skipon_spawn_backend('subint',
reason=...)` marker (a617b521) across the suites
reproducing the `subint` GIL-contention / starvation
hang classes doc'd in `ai/conc-anal/subint_*_issue.md`.

Deats,
- Module-level `pytestmark` on full-file-hanging suites:
  - `tests/test_cancellation.py`
  - `tests/test_inter_peer_cancellation.py`
  - `tests/test_pubsub.py`
  - `tests/test_shm.py`
- Per-test decorator where only one test in the file
  hangs:
  - `tests/discovery/test_registrar.py
    ::test_stale_entry_is_deleted` — replaces the
    inline `if start_method == 'subint': pytest.skip`
    branch with a declarative skip.
  - `tests/test_subint_cancellation.py
    ::test_subint_non_checkpointing_child`.
- A few per-test decorators are left commented-in-
  place as breadcrumbs for later finer-grained unskips.

Also, some nearby tidying in the affected files:
- Annotate loose fixture / test params
  (`pytest.FixtureRequest`, `str`, `tuple`, `bool`) in
  `tests/conftest.py`, `tests/devx/conftest.py`, and
  `tests/test_cancellation.py`.
- Normalize `"""..."""` → `'''...'''` docstrings per
  repo convention on a few touched tests.
- Add `timeout=6` / `timeout=10` to
  `@tractor_test(...)` on `test_cancel_infinite_streamer`
  and `test_some_cancels_all`.
- Drop redundant `spawn_backend` param from
  `test_cancel_via_SIGINT`; use `start_method` in the
  `'mp' in ...` check instead.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-21 21:33:15 -04:00
Gud Boi a617b52140 Add `skipon_spawn_backend` pytest marker
A reusable `@pytest.mark.skipon_spawn_backend( '<backend>' [, ...],
reason='...')` marker for backend-specific known-hang / -borked cases
— avoids scattering `@pytest.mark.skipif(lambda ...)` branches across
tests that misbehave under a particular `--spawn-backend`.

Deats,
- `pytest_configure()` registers the marker via
  `addinivalue_line('markers', ...)`.
- New `pytest_collection_modifyitems()` hook walks
  each collected item with `item.iter_markers(
  name='skipon_spawn_backend')`, checks whether the
  active `--spawn-backend` appears in `mark.args`, and
  if so injects a concrete `pytest.mark.skip(
  reason=...)`. `iter_markers()` makes the decorator
  work at function, class, or module (`pytestmark =
  [...]`) scope transparently.
- First matching mark wins; default reason is
  `f'Borked on --spawn-backend={backend!r}'` if the
  caller doesn't supply one.

Also, tighten type annotations on nearby `pytest`
integration points — `pytest_configure`, `debug_mode`,
`spawn_backend`, `tpt_protos`, `tpt_proto` — now taking
typed `pytest.Config` / `pytest.FixtureRequest` params.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-21 21:24:51 -04:00
Gud Boi e3318a5483 Expand `subint` sigint-starvation hang catalog
Add two more tests to the catalog in
`conc-anal/subint_sigint_starvation_issue.md` — same
signal-wakeup-fd-saturation fingerprint (abandoned legacy-subint driver
threads → shared-GIL starvation → `write() = EAGAIN` on the wakeup pipe
→ silent SIGINT drop), different load patterns.

Deats,
- `test_cancel_while_childs_child_in_sync_sleep[subint-False]`: nested
  actor-tree + sync-sleeping grandchild. Under `trio`/`mp_*` the "zombie
  reaper" is a subproc `SIGKILL`; no equivalent exists under subint, so
  the grandchild persists in its abandoned driver thread. Often only
  manifests under full-suite runs (earlier tests seed the
  abandoned-thread pool).

- `test_multierror_fast_nursery[subint-25-0.5]`: 25 concurrent subactors
  all go through teardown on the multierror. Bounded hard-kills run in
  parallel — so the total budget is ~3s, not 3s × 25. Leaves 25
  abandoned driver threads simultaneously alive, an extreme pressure
  multiplier. `strace` shows several successful `write(16, "\2", 1) = 1`
  (GIL round-robin IS giving main brief slices) before finally
  saturating with `EAGAIN`.

Also include a `pstree -snapt <pid>` capture showing
16+ live `{subint-driver[<interp_id>}` threads at the
moment of hang — the direct GIL-contender population.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-21 17:42:37 -04:00
Gud Boi e2b1035ff0 Skip `test_stale_entry_is_deleted` hanger with `subint`s 2026-04-21 13:36:41 -04:00
Gud Boi 17f1ea5910 Add global 200s `pytest-timeout` 2026-04-21 13:33:34 -04:00
Gud Boi 9bd113154b Bump lock-file for `pytest-timeout` + 3.13 gated wheel-deps 2026-04-20 20:57:26 -04:00
Gud Boi 5ea5fb211d Wall-cap `subint` audit tests via `pytest-timeout`
Add a hard process-level wall-clock bound on the two
known-hanging subint-backend tests so an unattended
suite run can't wedge indefinitely in either of the
hang classes doc'd in `ai/conc-anal/`.

Deats,
- New `testing` dep: `pytest-timeout>=2.3`.
- `test_stale_entry_is_deleted`:
  `@pytest.mark.timeout(3, method='thread')`. The
  `method='thread'` choice is deliberate —
  `method='signal'` routes via `SIGALRM` which is
  starved by the same GIL-hostage path that drops
  `SIGINT` (see `subint_sigint_starvation_issue.md`),
  so it'd never actually fire in the starvation case.
- `test_subint_non_checkpointing_child`: same
  decorator, same reasoning (defense-in-depth over
  the inner `trio.fail_after(15)`).

At timeout, `pytest-timeout` hard-kills the pytest
process itself — that's the intended behavior here;
the alternative is the suite never returning.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-20 20:45:56 -04:00
Gud Boi 489dc6d0cc Add prompt-io log for `subint` hang-class docs
Log the `claude-opus-4-7` collab that produced `e92e3cd2` ("Doc `subint`
backend hang classes + arm `dump_on_hang`"). Substantive bc the two new
`ai/conc-anal/` docs were jointly authored — user framed the two-class
split + set candidate-fix ordering for the class-2 (Ctrl-C-able) hang;
claude drafted the prose and the test-side cross-linking comments.

`.raw.md` is in diff-ref mode — per-file pointers via `git diff
e92e3cd2~1..e92e3cd2 -- <path>` rather than re-embedding content that
already lives in `git log -p`.

Prompt-IO: ai/prompt-io/claude/20260420T192739Z_5e8cd8b2_prompt_io.md

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-20 16:41:07 -04:00
Gud Boi 35796ec8ae Doc `subint` backend hang classes + arm `dump_on_hang`
Classify and write up the two distinct hang modes hit during Phase
B subint bringup (issue #379) so future triage doesn't re-derive them
from scratch.

Deats, two new `ai/conc-anal/` docs,
- `subint_sigint_starvation_issue.md`: abandoned legacy-subint thread
  + shared GIL → main trio loop starves → signal-wakeup-fd pipe fills
  → `SIGINT` silently dropped (`strace` shows `write() = EAGAIN` on the
  wakeup-fd). Un- Ctrl-C-able. Structurally a CPython limit; blocked on
  `msgspec` PEP 684 (jcrist/msgspec#563)

- `subint_cancel_delivery_hang_issue.md`: parent-side trio task parks on
  an orphaned IPC channel after subint teardown — no clean EOF delivered
  to the waiting receive. Ctrl-C-able (main loop iterates fine); OUR bug
  to fix. Candidate fix: explicit parent-side channel abort in
  `subint_proc`'s hard-kill teardown

Cross-link the docs from their test reproducers,
- `test_stale_entry_is_deleted` (→ starvation class): wrap
  `trio.run(main)` in `dump_on_hang(seconds=20)` so a future regression
  captures a stack dump. Kept un- skipped so the dump file is
  inspectable

- `test_subint_non_checkpointing_child` (→ delivery class): extend
  docstring with a "KNOWN ISSUE" block pointing at the analysis

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-20 16:41:07 -04:00
Gud Boi baf7ec54ac Add `subint` cancellation + hard-kill test audit
Lock in the escape-hatch machinery added to `tractor.spawn._subint`
during the Phase B.2/B.3 bringup (issue #379) so future stdlib
regressions or our own refactors don't silently re-introduce the
mid-suite hangs.

Deats,
- `test_subint_happy_teardown`: baseline — spawn a subactor, one portal
  RPC, clean teardown. If this breaks, something's wrong unrelated to
  the hard-kill shields.
- `test_subint_non_checkpointing_child`: cancel a subactor stuck in
  a non-checkpointing Python loop (`threading.Event.wait()` releases the
  GIL but never inserts a trio checkpoint). Validates the bounded-shield
  + daemon-driver-thread combo abandons the thread after
    `_HARD_KILL_TIMEOUT`.

Every test is wrapped in `trio.fail_after()` for a deterministic
per-test wall-clock ceiling (an unbounded audit would defeat itself) and
arms `tractor.devx.dump_on_hang()` so a hang captures a stack dump
— pytest's stderr capture swallows `faulthandler` output by default.

Gated via `pytest.importorskip('concurrent.interpreters')` and
a module-level skip when `--spawn-backend` isn't `'subint'`.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-20 16:41:07 -04:00
Gud Boi 79390a4e3a Raise `subint` floor to py3.14 and split dep-groups
The private `_interpreters` C module ships since 3.13, but that vintage
wedges under our `threading.Thread` + multi-trio usage pattern
—> `_interpreters.exec()` silently never makes progress. 3.14 fixes it.
So gate on the presence of the public `concurrent.interpreters` wrapper
(3.14+ only) even tho we still call into the private module at runtime.

Deats,
- `try_set_start_method('subint')` error msg + `_subint` module
  docstring/comments rewritten to document the 3.14 floor and why 3.13
  can't work.
- `_subint._has_subints` gate now imports `concurrent.interpreters` (not
  `_interpreters`) as the version sentinel.

Also, reshuffle `pyproject.toml` deps into
per-python-version `[tool.uv.dependency-groups]`:
- `subints` group: `msgspec>=0.21.0`, py>=3.14
- `eventfd` group: `cffi>=1.17.1`, py>=3.13,<3.14
- `sync_pause` group: `greenback`, py>=3.13,<3.14
  (was in `devx`; moved out bc no 3.14 yet)

Bump top-level `msgspec>=0.20.0` too.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-20 16:41:03 -04:00
Gud Boi dbe2e8bd82 Add `._debug_hangs` to `.devx` for hang triage
Bottle up the diagnostic primitives that actually cracked the
silent mid-suite hangs in the `subint` spawn-backend bringup (issue
there" session has them on the shelf instead of reinventing from
scratch.

Deats,
- `dump_on_hang(seconds, *, path)` — context manager wrapping
  `faulthandler.dump_traceback_later()`. Critical gotcha baked in:
  dumps go to a *file*, not `sys.stderr`, bc pytest's stderr
  capture silently eats the output and you can spend an hour
  convinced you're looking at the wrong thing
- `track_resource_deltas(label, *, writer)` — context manager
  logging per-block `(threading.active_count(),
  len(_interpreters.list_all()))` deltas; quickly rules out
  leak-accumulation theories when a suite progressively worsens (if
  counts don't grow, it's not a leak, look for a race on shared
  cleanup instead)
- `resource_delta_fixture(*, autouse, writer)` — factory returning
  a `pytest` fixture wrapping `track_resource_deltas` per-test; opt
  in by importing into a `conftest.py`. Kept as a factory (not a
  bare fixture) so callers own `autouse` / `writer` wiring

Also,
- export the three names from `tractor.devx`
- dep-free on py<3.13 (swallows `ImportError` for `_interpreters`)
- link back to the provenance in the module docstring (issue #379 /
  commit `26fb820`)

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-20 16:39:32 -04:00
Gud Boi fe753f9656 Bound subint teardown shields with hard-kill timeout
Unbounded `trio.CancelScope(shield=True)` at the
soft-kill and thread-join sites can wedge the parent
trio loop indefinitely when a stuck subint ignores
portal-cancel (e.g. bc the IPC channel is already
broken).

Deats,
- add `_HARD_KILL_TIMEOUT` (3s) module-level const
- wrap both shield sites with
  `trio.move_on_after()` so we abandon a stuck
  subint after the deadline
- flip driver thread to `daemon=True` so proc-exit
  also isn't blocked by a wedged subint
- pass `abandon_on_cancel=True` to
  `trio.to_thread.run_sync(driver_thread.join)`
  — load-bearing for `move_on_after` to actually
  fire
- log warnings when either timeout triggers
- improve `InterpreterError` log msg to explain
  the abandoned-thread scenario

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-20 16:39:32 -04:00
Gud Boi b48503085f Add prompt-IO log for subint destroy-race fix
Log the `claude-opus-4-7` session that produced
the `_subint.py` dedicated-thread fix (`26fb8206`).
Substantive bc the patch was entirely AI-generated;
raw log also preserves the CPython-internals
research informing Phase B.3 hard-kill work.

Prompt-IO: ai/prompt-io/claude/20260418T042526Z_26fb820_prompt_io.md

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-20 16:39:32 -04:00
Gud Boi 3869a9b468 Fix subint destroy race via dedicated OS thread
`trio.to_thread.run_sync(_interpreters.exec, ...)` runs `exec()` on
a cached worker thread — and when that thread is returned to the
cache after the subint's `trio.run()` exits, CPython still keeps
the subint's tstate attached to the (now idle) worker. Result: the
teardown `_interpreters.destroy(interp_id)` in the `finally` block
can block the parent's trio loop indefinitely, waiting for a tstate
release that only happens when the worker either picks up a new job
or exits.

Manifested as intermittent mid-suite hangs under
`--spawn-backend=subint` — caught by a
`faulthandler.dump_traceback_later()` showing the main thread stuck
in `_interpreters.destroy()` at `_subint.py:293` with only an idle
trio-cache worker as the other live thread.

Deats,
- drive the subint on a plain `threading.Thread` (not
  `trio.to_thread`) so the OS thread truly exits after
  `_interpreters.exec()` returns, releasing tstate and unblocking
  destroy
- signal `subint_exited.set()` back to the parent trio loop from
  the driver thread via `trio.from_thread.run_sync(...,
  trio_token=...)` — capture the token at `subint_proc` entry
- swallow `trio.RunFinishedError` in that signal path for the case
  where parent trio has already exited (proc teardown)
- in the teardown `finally`, off-load the sync
  `driver_thread.join()` to `trio.to_thread.run_sync` (cache thread
  w/ no subint tstate → safe) so we actually wait for the driver to
  exit before `_interpreters.destroy()`

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-20 16:39:32 -04:00
Gud Boi c2d8e967aa Doc the `_interpreters` private-API choice in `_subint`
Expand the comment block above the `_interpreters`
import explaining *why* we use the private C mod
over `concurrent.interpreters`: the public API only
exposes PEP 734's `'isolated'` config which breaks
`msgspec` (missing PEP 684 slot). Add reference
links to PEP 734, PEP 684, cpython sources, and
the msgspec upstream tracker (jcrist/msgspec#563).

Also,
- update error msgs in both `_spawn.py` and
  `_subint.py` to say "3.13+" (matching the actual
  `_interpreters` availability) instead of "3.14+".
- tweak the mod docstring to reflect py3.13+
  availability via the private C module.

Review: PR #444 (copilot-pull-request-reviewer)
https://github.com/goodboy/tractor/pull/444

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-20 16:39:32 -04:00
Gud Boi 1037c05eaf Avoid skip `.ipc._ringbuf` import when no `cffi` 2026-04-20 16:39:32 -04:00
Gud Boi 959fc48806 Impl min-viable `subint` spawn backend (B.2)
Replace the B.1 scaffold stub w/ a working spawn
flow driving PEP 734 sub-interpreters on dedicated
OS threads.

Deats,
- use private `_interpreters` C mod (not the public
  `concurrent.interpreters` API) to get `'legacy'`
  subint config — avoids PEP 684 C-ext compat
  issues w/ `msgspec` and other deps missing the
  `Py_mod_multiple_interpreters` slot
- bootstrap subint via code-string calling new
  `_actor_child_main()` from `_child.py` (shared
  entry for both CLI and subint backends)
- drive subint lifetime on an OS thread using
  `trio.to_thread.run_sync(_interpreters.exec, ..)`
- full supervision lifecycle mirrors `trio_proc`:
  `ipc_server.wait_for_peer()` → send `SpawnSpec`
  → yield `Portal` via `task_status.started()`
- graceful shutdown awaits the subint's inner
  `trio.run()` completing; cancel path sends
  `portal.cancel_actor()` then waits for thread
  join before `_interpreters.destroy()`

Also,
- extract `_actor_child_main()` from `_child.py`
  `__main__` block as callable entry shape bc the
  subint needs it for code-string bootstrap
- add `"subint"` to the `_runtime.py` spawn-method
  check so child accepts `SpawnSpec` over IPC

Prompt-IO: ai/prompt-io/claude/20260417T124437Z_5cd6df5_prompt_io.md

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-20 16:37:21 -04:00
Gud Boi 2e9dbc5b12 Handle py3.14+ incompats as test skips
Since we're devving subints we require the 3.14+ stdlib API
and a couple compiled libs don't support it yet, namely:
- `cffi`, which we're only using for the `.ipc._linux` eventfd
  stuff (now factored into `hotbaud` anyway).
- `greenback`, which requires `greenlet` which doesn't seem to be
  wheeled yet
  * on nixos the sdist build was failing due to lack of `g++` which
    i don't care to figure out rn since we don't need `.devx` stuff
    immediately for this subints prototype.
  * [ ] we still need to adjust any dependent suites to skip.

Adjust `test_ringbuf` to skip on import failure.

Also project wide,
- pin us to py 3.13+ in prep for last-2-minor-version policy.
- drop `msgspec>=0.20.0`, the first release with py3.14 support.
2026-04-20 16:30:46 -04:00
Gud Boi 7cee62ce42 Add `'subint'` spawn backend scaffold (#379)
Land the scaffolding for a future sub-interpreter (PEP 734
`concurrent.interpreters`) actor spawn backend per issue #379. The
spawn flow itself is not yet implemented; `subint_proc()` raises a
placeholder `NotImplementedError` pointing at the tracking issue —
this commit only wires up the registry, the py-version gate, and
the harness.

Deats,
- bump `pyproject.toml` `requires-python` to `>=3.12, <3.15` and
  list the `3.14` classifier — the new stdlib
  `concurrent.interpreters` module only ships on 3.14
- extend `SpawnMethodKey = Literal[..., 'subint']`
- `try_set_start_method('subint')` grows a new `match` arm that
  feature-detects the stdlib module and raises `RuntimeError` with
  a clear banner on py<3.14
- `_methods` registers the new `subint_proc()` via the same
  bottom-of-module late-import pattern used for `._trio` / `._mp`

Also,
- new `tractor/spawn/_subint.py` — top-level `try: from concurrent
  import interpreters` guards `_has_subints: bool`; `subint_proc()`
  signature mirrors `trio_proc`/`mp_proc` so the Phase B.2 impl can
  drop in without touching the registry
- re-add `import sys` to `_spawn.py` (needed for the py-version msg
  in the gate-error)
- `_testing.pytest.pytest_configure` wraps `try_set_start_method()`
  in a `pytest.UsageError` handler so `--spawn-backend=subint` on
  py<3.14 prints a clean banner instead of a traceback

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-20 16:09:43 -04:00
Gud Boi 3773ad8b77 Pin `xonsh` to GH `main` in editable mode 2026-04-20 16:09:43 -04:00
Gud Boi 73e83e1474 Bump `xonsh` to latest pre `0.23` release 2026-04-20 16:08:05 -04:00
Gud Boi 355beac082 Expand `/run-tests` venv pre-flight to cover all cases
Rework section 3 from a worktree-only check into a
structured 3-step flow: detect active venv, interpret
results (Case A: active, B: none, C: worktree), then
run import + collection checks.

Deats,
- Case B prompts via `AskUserQuestion` when no venv
  is detected, offering `uv sync` or manual activate
- add `uv run` fallback section for envs where venv
  activation isn't practical
- new allowed-tools: `uv run python`, `uv run pytest`,
  `uv pip show`, `AskUserQuestion`

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-20 16:08:05 -04:00
Gud Boi 263249029b Add `lastfailed` cache inspection to `/run-tests` skill
New "Inspect last failures" section reads the pytest
`lastfailed` cache JSON directly — instant, no
collection overhead, and filters to `tests/`-prefixed
entries to avoid stale junk paths.

Also,
- add `jq` tool permission for `.pytest_cache/` files

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-20 16:08:05 -04:00
Gud Boi 292179d8f3 Reorganize `.gitignore` by skill/purpose
Group `.claude/` ignores per-skill instead of a
flat list: `ai.skillz` symlinks, `/open-wkt`,
`/code-review-changes`, `/pr-msg`, `/commit-msg`.
Add missing symlink entries (`yt-url-lookup` ->
`resolve-conflicts`, `inter-skill-review`). Drop
stale `Claude worktrees` section (already covered
by `.claude/wkts/`).

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-20 16:08:05 -04:00
Gud Boi 9a360fbbcd Ignore notes & snippets subdirs in `git` 2026-04-20 16:08:05 -04:00
10 changed files with 25 additions and 636 deletions

View File

@ -229,69 +229,3 @@ Unlike asyncio, trio allows checkpoints in
that does `await` can itself be cancelled (e.g.
by nursery shutdown). Watch for cleanup code that
assumes it will run to completion.
### Unbounded waits in cleanup paths
Any `await <event>.wait()` in a teardown path is
a latent deadlock unless the event's setter is
GUARANTEED to fire. If the setter depends on
external state (peer disconnects, child process
exit, subsequent task completion) that itself
depends on the current task's progress, you have
a mutual wait.
Rule: **bound every `await X.wait()` in cleanup
paths with `trio.move_on_after()`** unless you
can prove the setter is unconditionally reachable
from the state at the await site. Concrete recent
example: `ipc_server.wait_for_no_more_peers()` in
`async_main`'s finally (see
`ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md`
"probe iteration 3") — it was unbounded, and when
one peer-handler was stuck the wait-for-no-more-
peers event never fired, deadlocking the whole
actor-tree teardown cascade.
### The capture-pipe-fill hang pattern (grep this first)
When investigating any hang in the test suite
**especially under fork-based backends**, first
check whether the hang reproduces under `pytest
-s` (`--capture=no`). If `-s` makes it go away
you're not looking at a trio concurrency bug —
you're looking at a Linux pipe-buffer fill.
Mechanism: pytest replaces fds 1,2 with pipe
write-ends. Fork-child subactors inherit those
fds. High-volume error-log tracebacks (cancel
cascade spew) fill the 64KB pipe buffer. Child
`write()` blocks. Child can't exit. Parent's
`waitpid`/pidfd wait blocks. Deadlock cascades up
the tree.
Pre-existing guards in `tests/conftest.py` encode
this knowledge — grep these BEFORE blaming
concurrency:
```python
# tests/conftest.py:258
if loglevel in ('trace', 'debug'):
# XXX: too much logging will lock up the subproc (smh)
loglevel: str = 'info'
# tests/conftest.py:316
# can lock up on the `_io.BufferedReader` and hang..
stderr: str = proc.stderr.read().decode()
```
Full post-mortem +
`ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md`
for the canonical reproduction. Cost several
investigation sessions before catching it —
because the capture-pipe symptom was masked by
deeper cascade-deadlocks. Once the cascades were
fixed, the tree tore down enough to generate
pipe-filling log volume → capture-pipe finally
surfaced. Grep-note for future-self: **if a
multi-subproc tractor test hangs, `pytest -s`
first, conc-anal second.**

View File

@ -451,73 +451,3 @@ by your changes — note them and move on.
**Rule of thumb**: if a test fails with `TooSlowError`,
`trio.TooSlowError`, or `pexpect.TIMEOUT` and you didn't
touch the relevant code path, it's flaky — skip it.
## 9. The pytest-capture hang pattern (CHECK THIS FIRST)
**Symptom:** a tractor test hangs indefinitely under
default `pytest` but passes instantly when you add
`-s` (`--capture=no`).
**Cause:** tractor subactors (especially under fork-
based backends) inherit pytest's stdout/stderr
capture pipes via fds 1,2. Under high-volume error
logging (e.g. multi-level cancel cascade, nested
`run_in_actor` failures, anything triggering
`RemoteActorError` + `ExceptionGroup` traceback
spew), the **64KB Linux pipe buffer fills** faster
than pytest drains it. Subactor writes block → can't
finish exit → parent's `waitpid`/pidfd wait blocks →
deadlock cascades up the tree.
**Pre-existing guards in the tractor harness** that
encode this same knowledge — grep these FIRST
before spelunking:
- `tests/conftest.py:258-260` (in the `daemon`
fixture): `# XXX: too much logging will lock up
the subproc (smh)` — downgrades `trace`/`debug`
loglevel to `info` to prevent the hang.
- `tests/conftest.py:316`: `# can lock up on the
_io.BufferedReader and hang..` — noted on the
`proc.stderr.read()` post-SIGINT.
**Debug recipe (in priority order):**
1. **Try `-s` first.** If the hang disappears with
`pytest -s`, you've confirmed it's capture-pipe
fill. Skip spelunking.
2. **Lower the loglevel.** Default `--ll=error` on
this project; if you've bumped it to `debug` /
`info`, try dropping back. Each log level
multiplies pipe-pressure under fault cascades.
3. **If you MUST use default capture + high log
volume**, redirect subactor stdout/stderr in the
child prelude (e.g.
`tractor.spawn._subint_forkserver._child_target`
post-`_close_inherited_fds`) to `/dev/null` or a
file.
**Signature tells you it's THIS bug (vs. a real
code hang):**
- Multi-actor test under fork-based backend
(`subint_forkserver`, eventually `trio_proc` too
under enough log volume).
- Multiple `RemoteActorError` / `ExceptionGroup`
tracebacks in the error path.
- Test passes with `-s` in the 5-10s range, hangs
past pytest-timeout (usually 30+ s) without `-s`.
- Subactor processes visible via `pgrep -af
subint-forkserv` or similar after the hang —
they're alive but blocked on `write()` to an
inherited stdout fd.
**Historical reference:** this deadlock cost a
multi-session investigation (4 genuine cascade
fixes landed along the way) that only surfaced the
capture-pipe issue AFTER the deeper fixes let the
tree actually tear down enough to produce pipe-
filling log volume. Full post-mortem in
`ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md`.
Lesson codified here so future-me grep-finds the
workaround before digging.

View File

@ -395,432 +395,6 @@ Candidate follow-up experiments:
re-raise means it should still exit. Unless
something higher up swallows it.
### Attempted fix (DID NOT work) — hypothesis (3)
Tried: in `_serve_ipc_eps` finally, after closing
listeners, also iterate `server._peers` and
sync-close each peer channel's underlying stream
socket fd:
```python
for _uid, _chans in list(server._peers.items()):
for _chan in _chans:
try:
_stream = _chan._transport.stream if _chan._transport else None
if _stream is not None:
_stream.socket.close() # sync fd close
except (AttributeError, OSError):
pass
```
Theory: closing the socket fd from outside the stuck
recv task would make the recv see EBADF /
ClosedResourceError and unblock.
Result: `test_nested_multierrors[subint_forkserver]`
still hangs identically. Either:
- The sync `socket.close()` doesn't propagate into
trio's in-flight `recv_some()` the way I expected
(trio may hold an internal reference that keeps the
fd open even after an external close), or
- The stuck recv isn't even the root blocker and the
peer handlers never reach the finally for some
reason I haven't understood yet.
Either way, the sync-close hypothesis is **ruled
out**. Reverted the experiment, restored the skip-
mark on the test.
### Aside: `-s` flag does NOT change `test_nested_multierrors` behavior
Tested explicitly: both with and without `-s`, the
test hangs identically. So the capture-pipe-fill
hypothesis is **ruled out** for this test.
The earlier `test_context_stream_semantics.py` `-s`
observation was most likely caused by a competing
pytest run in my session (confirmed via process list
— my leftover pytest was alive at that time and
could have been holding state on the default
registry port).
## Update — 2026-04-23 (late): cancel delivery ruled in, nursery-wait ruled BLOCKER
**New diagnostic run** instrumented
`handle_stream_from_peer` at ENTER / `except
trio.Cancelled:` / finally, plus `Actor.cancel()`
just before `self._parent_chan_cs.cancel()`. Result:
- **40 `handle_stream_from_peer` ENTERs**.
- **0 `except trio.Cancelled:` hits** — cancel
never fires on any peer-handler.
- **35 finally hits** — those handlers exit via
peer-initiated EOF (normal return), NOT cancel.
- **5 handlers never reach finally** — stuck forever.
- **`Actor.cancel()` fired in 12 PIDs** — but the
PIDs with peer handlers that DIDN'T fire
Actor.cancel are exactly **root + 2 direct
spawners**. These 3 actors have peer handlers
(for their own subactors) that stay stuck because
**`Actor.cancel()` at these levels never runs**.
### The actual deadlock shape
`Actor.cancel()` lives in
`open_root_actor.__aexit__` / `async_main` teardown.
That only runs when the enclosing `async with
tractor.open_nursery()` exits. The nursery's
`__aexit__` calls the backend `*_proc` spawn target's
teardown, which does `soft_kill() →
_ForkedProc.wait()` on its child PID. That wait is
trio-cancellable via pidfd now (good) — but nothing
CANCELS it because the outer scope only cancels when
`Actor.cancel()` runs, which only runs when the
nursery completes, which waits on the child.
It's a **multi-level mutual wait**:
```
root blocks on spawner.wait()
spawner blocks on grandchild.wait()
grandchild blocks on errorer.wait()
errorer Actor.cancel() ran, but process
may not have fully exited yet
(something in root_tn holding on?)
```
Each level waits for the level below. The bottom
level (errorer) reaches Actor.cancel(), but its
process may not fully exit — meaning its pidfd
doesn't go readable, meaning the grandchild's
waitpid doesn't return, meaning the grandchild's
nursery doesn't unwind, etc. all the way up.
### Refined question
**Why does an errorer process not exit after its
`Actor.cancel()` completes?**
Possibilities:
1. `_parent_chan_cs.cancel()` fires (shielded
parent-chan loop unshielded), but the task is
stuck INSIDE the shielded loop's recv in a way
that cancel still can't break.
2. After `Actor.cancel()` returns, `async_main`
still has other tasks in `root_tn` waiting for
something that never arrives (e.g. outbound
IPC reply delivery).
3. The `os._exit(rc)` in `_worker` (at
`_subint_forkserver.py`) doesn't run because
`_child_target` never returns.
Next-session candidate probes (in priority order):
1. **Instrument `_worker`'s fork-child branch** to
confirm whether `child_target()` returns (and
thus `os._exit(rc)` is reached) for errorer
PIDs. If yes → process should die; if no →
trace back into `_actor_child_main` /
`_trio_main` / `async_main` to find the stuck
spot.
2. **Instrument `async_main`'s final unwind** to
see which await in the teardown doesn't
complete.
3. **Compare under `trio_proc` backend** at the
same `_worker`-equivalent level to see where
the flows diverge.
### Rule-out: NOT a stuck peer-chan recv
Earlier hypothesis was that the 5 stuck peer-chan
loops were blocked on a socket recv that cancel
couldn't interrupt. This pass revealed the real
cause: cancel **never reaches those tasks** because
their owning actor's `Actor.cancel()` never runs.
The recvs are fine — they're just parked because
nothing is telling them to stop.
## Update — 2026-04-23 (very late): leaves exit, middle actors stuck in `trio.run`
Yet another instrumentation pass — this time
printing at:
- `_worker` child branch: `pre child_target()` /
`child_target RETURNED rc=N` / `about to
os._exit(rc)`
- `_trio_main`: `about to trio.run` /
`trio.run RETURNED NORMALLY` / `FINALLY`
**Fresh-run results** (`test_nested_multierrors[
subint_forkserver]`, depth=1/breadth=2, 1 root + 14
forked = 15 actors total):
- **9 processes completed the full flow**
`trio.run RETURNED NORMALLY` → `child_target
RETURNED rc=0` → `about to os._exit(0)`. These
are the LEAVES of the tree (errorer actors) plus
their direct parents (depth-0 spawners). They
actually exit their processes.
- **5 processes are stuck INSIDE `trio.run(trio_main)`**
— they hit "about to trio.run" but NEVER see
"trio.run RETURNED NORMALLY". These are root +
top-level spawners + one intermediate.
**What this means:** `async_main` itself is the
deadlock holder, not the peer-channel loops.
Specifically, the outer `async with root_tn:` in
`async_main` never exits for the 5 stuck actors.
Their `trio.run` never returns → `_trio_main`
catch/finally never runs → `_worker` never reaches
`os._exit(rc)` → the PROCESS never dies → its
parent's `_ForkedProc.wait()` blocks → parent's
nursery hangs → parent's `async_main` hangs → ...
### The new precise question
**What task in the 5 stuck actors' `async_main`
never completes?** Candidates:
1. The shielded parent-chan `process_messages`
task in `root_tn` — but we explicitly cancel it
via `_parent_chan_cs.cancel()` in `Actor.cancel()`.
However, `Actor.cancel()` only runs during
`open_root_actor.__aexit__`, which itself runs
only after `async_main`'s outer unwind — which
doesn't happen. So the shield isn't broken.
2. `await actor_nursery._join_procs.wait()` or
similar in the inline backend `*_proc` flow.
3. `_ForkedProc.wait()` on a grandchild that
actually DID exit — but the pidfd_open watch
didn't fire for some reason (race between
pidfd_open and the child exiting?).
The most specific next probe: **add DIAG around
`_ForkedProc.wait()` enter/exit** to see whether
the pidfd-based wait returns for every grandchild
exit. If a stuck parent's `_ForkedProc.wait()`
NEVER returns despite its child exiting, the
pidfd mechanism has a race bug under nested
forkserver.
Alternative probe: instrument `async_main`'s outer
nursery exits to find which nursery's `__aexit__`
is stuck, drilling down from `trio.run` to the
specific `async with` that never completes.
### Cascade summary (updated tree view)
```
ROOT (pytest) STUCK in trio.run
├── top_0 (spawner, d=1) STUCK in trio.run
│ ├── spawner_0_d1_0 (d=0) exited (os._exit 0)
│ │ ├── errorer_0_0 exited (os._exit 0)
│ │ └── errorer_0_1 exited (os._exit 0)
│ └── spawner_0_d1_1 (d=0) exited (os._exit 0)
│ ├── errorer_0_2 exited (os._exit 0)
│ └── errorer_0_3 exited (os._exit 0)
└── top_1 (spawner, d=1) STUCK in trio.run
├── spawner_1_d1_0 (d=0) STUCK in trio.run (sibling race?)
│ ├── errorer_1_0 exited
│ └── errorer_1_1 exited
└── spawner_1_d1_1 (d=0) STUCK in trio.run
├── errorer_1_2 exited
└── errorer_1_3 exited
```
Grandchildren (d=0 spawners) exit OR stick —
asymmetric. Not purely depth-determined. Some race
condition in nursery teardown when multiple
siblings error simultaneously.
## Update — 2026-04-23 (late, probe iteration 3): hang pinpointed to `wait_for_no_more_peers()`
Further DIAGDEBUG at every milestone in `async_main`
(runtime UP / EXITED service_tn / EXITED root_tn /
FINALLY ENTER / RETURNING) plus `_ForkedProc.wait`
ENTER/RETURNED per-pidfd. Result:
**Every stuck actor reaches `async_main: FINALLY
ENTER` but NOT `async_main: RETURNING`.**
That isolates the hang to a specific await in
`async_main`'s finally block at
`tractor/runtime/_runtime.py:1837+`. The suspect:
```python
# Ensure all peers (actors connected to us as clients) are finished
if ipc_server := actor.ipc_server and ipc_server.has_peers(check_chans=True):
...
await ipc_server.wait_for_no_more_peers() # ← UNBOUNDED, blocks forever
```
`_no_more_peers` is an `Event` set only when
`server._peers` empties (see
`ipc/_server.py:526-530`). If ANY peer-handler is
stuck (the 5 unclosed loops from the earlier pass),
it keeps its channel in `server._peers`, so the
event never fires, so the wait hangs.
### Applied fix (partial, landed as defensive-in-depth)
`tractor/runtime/_runtime.py:1981`
`wait_for_no_more_peers()` call now wrapped in
`trio.move_on_after(3.0)` + a warning log when the
timeout fires. Commented with the full rationale.
**Verified:** with this fix, ALL 15 actors reach
`async_main: RETURNING` cleanly (up from 10/15
reaching end before).
**Unfortunately:** the test still hangs past 45s
total — meaning there's YET ANOTHER unbounded wait
downstream of `async_main`. The bounded
`wait_for_no_more_peers` unblocks one level, but
the cascade has another level above it.
### Candidates for the remaining hang
1. `open_root_actor`'s own finally / post-
`async_main` flow in `_root.py` — specifically
`await actor.cancel(None)` which has its own
internal waits.
2. The `trio.run()` itself doesn't return even
after the root task completes because trio's
nursery still has background tasks running.
3. Maybe `_serve_ipc_eps`'s finally has an await
that blocks when peers aren't clearing.
### Current stance
- Defensive `wait_for_no_more_peers` bound landed
(good hygiene regardless). Revealing a real
deadlock-avoidance gap in tractor's cleanup.
- Test still hangs → skip-mark restored on
`test_nested_multierrors[subint_forkserver]`.
- The full chain of unbounded waits needs another
session of drilling, probably at
`open_root_actor` / `actor.cancel` level.
### Summary of this investigation's wins
1. **FD hygiene fix** (`_close_inherited_fds`) —
correct, closed orphan-SIGINT sibling issue.
2. **pidfd-based `_ForkedProc.wait`** — cancellable,
matches trio_proc pattern.
3. **`_parent_chan_cs` wiring** —
`Actor.cancel()` now breaks the shielded parent-
chan `process_messages` loop.
4. **`wait_for_no_more_peers` bounded** —
prevents the actor-level finally hang.
5. **Ruled-out hypotheses:** tree-kill missing
(wrong), stuck socket recv (wrong).
6. **Pinpointed remaining unknown:** at least one
more unbounded wait in the teardown cascade
above `async_main`. Concrete candidates
enumerated above.
## Update — 2026-04-23 (VERY late): pytest capture pipe IS the final gate
After landing fixes 1-4 and instrumenting every
layer down to `tractor_test`'s `trio.run(_main)`:
**Empirical result: with `pytest -s` the test PASSES
in 6.20s.** Without `-s` (default `--capture=fd`) it
hangs forever.
DIAG timeline for the root pytest PID (with `-s`
implied from later verification):
```
tractor_test: about to trio.run(_main)
open_root_actor: async_main task started, yielding to test body
_main: about to await wrapped test fn
_main: wrapped RETURNED cleanly ← test body completed!
open_root_actor: about to actor.cancel(None)
Actor.cancel ENTER req_chan=False
Actor.cancel RETURN
open_root_actor: actor.cancel RETURNED
open_root_actor: outer FINALLY
open_root_actor: finally END (returning from ctxmgr)
tractor_test: trio.run FINALLY (returned or raised) ← trio.run fully returned!
```
`trio.run()` fully returns. The test body itself
completes successfully (pytest.raises absorbed the
expected `BaseExceptionGroup`). What blocks is
**pytest's own stdout/stderr capture** — under
`--capture=fd` default, pytest replaces the parent
process's fd 1,2 with pipe write-ends it's reading
from. Fork children inherit those pipe fds
(because `_close_inherited_fds` correctly preserves
stdio). High-volume subactor error-log tracebacks
(7+ actors each logging multiple
`RemoteActorError`/`ExceptionGroup` tracebacks on
the error-propagation cascade) fill the 64KB Linux
pipe buffer. Subactor writes block. Subactor can't
progress. Process doesn't exit. Parent's
`_ForkedProc.wait` (now pidfd-based and
cancellable, but nothing's cancelling here since
the test body already completed) keeps the pipe
reader alive... but pytest isn't draining its end
fast enough because test-teardown/fixture-cleanup
is in progress.
**Actually** the exact mechanism is slightly
different: pytest's capture fixture MIGHT be
actively reading, but faster-than-writer subactors
overflow its internal buffer. Or pytest might be
blocked itself on the finalization step.
Either way, `-s` conclusively fixes it.
### Why I ruled this out earlier (and shouldn't have)
Earlier in this investigation I tested
`test_nested_multierrors` with/without `-s` and
both hung. That's because AT THAT TIME, fixes 1-4
weren't all in place yet. The test was hanging at
multiple deeper levels long before reaching the
"generate lots of error-log output" phase. Once
the cascade actually tore down cleanly, enough
output was produced to hit the capture-pipe limit.
**Classic order-of-operations mistake in
debugging:** ruling something out too early based
on a test that was actually failing for a
different reason.
### Fix direction (next session)
Redirect subactor stdout/stderr to `/dev/null` (or
a session-scoped log file) in the fork-child
prelude, right after `_close_inherited_fds()`. This
severs the inherited pytest-capture pipes and lets
subactor output flow elsewhere. Under normal
production use (non-pytest), stdout/stderr would
be the TTY — we'd want to keep that. So the
redirect should be conditional or opt-in via the
`child_sigint`/proc_kwargs flag family.
Alternative: document as a gotcha and recommend
`pytest -s` for any tests using the
`subint_forkserver` backend with multi-level actor
trees. Simpler, user-visible, no code change.
### Current state
- Skip-mark on `test_nested_multierrors[subint_forkserver]`
restored with reason pointing here.
- Test confirmed passing with `-s` after all 4
cascade fixes applied.
- The 4 cascade fixes are NOT wasted — they're
correct hardening regardless of the capture-pipe
issue, AND without them we'd never reach the
"actually produces enough output to fill the
pipe" state.
## Stopgap (landed)
`test_nested_multierrors` skip-marked under

View File

@ -24,7 +24,7 @@ Part of this work should include,
is happening under the hood with how cpython implements subints.
* default configuration should encourage state isolation as with
subprocs, but explicit public escape hatches to enable rigorously
with subprocs, but explicit public escape hatches to enable rigorously
managed shm channels for high performance apps.
- all tests should be (able to be) parameterized to use the new

View File

@ -446,23 +446,15 @@ def _process_alive(pid: int) -> bool:
return False
# Regressed back to xfail: previously passed after the
# fork-child FD-hygiene fix in `_close_inherited_fds()`,
# but the recent `wait_for_no_more_peers(move_on_after=3.0)`
# bound in `async_main`'s teardown added up to 3s to the
# orphan subactor's exit timeline, pushing it past the
# test's 10s poll window. Real fix requires making the
# bounded wait faster when the actor is orphaned, or
# increasing the test's poll window. See tracker doc
# `ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md`.
@pytest.mark.xfail(
strict=True,
reason=(
'Regressed to xfail after `wait_for_no_more_peers` '
'bound added ~3s teardown latency. Needs either '
'faster orphan-side teardown or 15s test poll window.'
),
)
# NOTE: was previously `@pytest.mark.xfail(strict=True, ...)`
# for the orphan-SIGINT hang documented in
# `ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md`
# — now passes after the fork-child FD-hygiene fix in
# `tractor.spawn._subint_forkserver._close_inherited_fds()`:
# closing all inherited FDs (including the parent's IPC
# listener + trio-epoll + wakeup-pipe FDs) lets the child's
# trio event loop respond cleanly to external SIGINT.
# Leaving the test in place as a regression guard.
@pytest.mark.timeout(
30,
method='thread',

View File

@ -455,16 +455,14 @@ async def spawn_and_error(
@pytest.mark.skipon_spawn_backend(
'subint_forkserver',
reason=(
'Passes cleanly with `pytest -s` (no stdout capture) '
'but hangs under default `--capture=fd` due to '
'pytest-capture-pipe buffer fill from high-volume '
'subactor error-log traceback output inherited via fds '
'1,2 in fork children. Fix direction: redirect subactor '
'stdout/stderr to `/dev/null` in `_child_target` / '
'`_actor_child_main` so forkserver children don\'t hold '
'pytest\'s capture pipe open. See `ai/conc-anal/'
'Multi-level fork-spawn cancel cascade hang — '
'peer-channel `process_messages` loops do not '
'exit on `service_tn.cancel_scope.cancel()`. '
'See `ai/conc-anal/'
'subint_forkserver_test_cancellation_leak_issue.md` '
'"Update — pytest capture pipe is the final gate".'
'for the full diagnosis + candidate fix directions. '
'Drop this mark once the peer-chan-loop exit issue '
'is closed.'
),
)
@pytest.mark.timeout(

View File

@ -1973,25 +1973,7 @@ async def async_main(
f' {pformat(ipc_server._peers)}'
)
log.runtime(teardown_report)
# NOTE: bound the peer-clear wait — otherwise if any
# peer-channel handler is stuck (e.g. never got its
# cancel propagated due to a runtime bug), this wait
# blocks forever and deadlocks the whole actor-tree
# teardown cascade. 3s is enough for any graceful
# cancel-ack round-trip; beyond that we're in bug
# territory and need to proceed with local teardown
# so the parent's `_ForkedProc.wait()` can unblock.
# See `ai/conc-anal/
# subint_forkserver_test_cancellation_leak_issue.md`
# for the full diagnosis.
with trio.move_on_after(3.0) as _peers_cs:
await ipc_server.wait_for_no_more_peers()
if _peers_cs.cancelled_caught:
teardown_report += (
f'-> TIMED OUT waiting for peers to clear '
f'({len(ipc_server._peers)} still connected)\n'
)
log.warning(teardown_report)
teardown_report += (
'-]> all peer channels are complete.\n'

View File

@ -252,13 +252,9 @@ def _close_inherited_fds(
os.close(fd)
closed += 1
except OSError:
# fd was already closed (race with listdir) or otherwise
# unclosable — either is fine.
log.exception(
f'Failed to close inherited fd in child ??\n'
f'{fd!r}\n'
)
# fd was already closed (race with listdir) or
# otherwise unclosable — either is fine.
pass
return closed
@ -405,17 +401,11 @@ def fork_from_worker_thread(
try:
os.close(rfd)
except OSError:
log.exception(
f'Failed to close PID-pipe read-fd in parent ??\n'
f'{rfd!r}\n'
)
pass
try:
os.close(wfd)
except OSError:
log.exception(
f'Failed to close PID-pipe write-fd in parent ??\n'
f'{wfd!r}\n'
)
pass
raise RuntimeError(
f'subint-forkserver worker thread '
f'{thread_name!r} did not return within '
@ -485,13 +475,6 @@ def run_subint_in_worker_thread(
_interpreters.exec(interp_id, bootstrap)
except BaseException as e:
err = e
log.exception(
f'Failed to .exec() in subint ??\n'
f'_interpreters.exec(\n'
f' interp_id={interp_id!r},\n'
f' bootstrap={bootstrap!r},\n'
f') => {err!r}\n'
)
worker: threading.Thread = threading.Thread(
target=_drive,

View File

@ -246,11 +246,7 @@ async def trio_proc(
await proc.wait()
await debug.maybe_wait_for_debugger(
# NOTE: use the child's `_runtime_vars`
# (the fn-arg dict shipped via `SpawnSpec`)
# — NOT `get_runtime_vars()` which returns
# the *parent's* live runtime state.
child_in_debug=_runtime_vars.get(
child_in_debug=get_runtime_vars().get(
'_debug_mode', False
),
header_msg=(