tractor

Commit Graph

Author	SHA1	Message	Date
Gud Boi	79dda4cb4a	Mk `test_no_runtime()` not require `pytest-trio`	2026-05-13 20:43:22 -04:00
Gud Boi	bd07a95d80	Use trace CM helpers in `test_dynamic_pub_sub` Replace inline `trio.fail_after` + manual `signal.alarm` guard with the `_testing.trace` CM helpers that auto-capture a full ptree/wchan/py-spy diag snapshot to disk on timeout. Deats, - inner guard: `trio.fail_after` → `fail_after_w_trace` (async CM, captures on `TooSlowError`). - outer AFK guard: raw `signal.alarm` → `afk_alarm_w_trace` (sync CM, captures on `SIGALRM`), only armed under fork backends. Extracts `_run_and_match()` helper to keep branching clean. - bump `fail_after_s` from 4/12 → 8/20 to stop borderline flakes while diag harness accumulates evidence. - drop `_DIAG_CAP_S` var + manual signal import (now internal to `afk_alarm_w_trace`). (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-05-13 20:39:37 -04:00
Gud Boi	32955db02e	Harden `test_cancellation` for fork-spawner backends Deats, - `pytestmark`: enrich `skipon_spawn_backend('subint')` reason with conc-anal doc refs + GH#379 link, add `reap_subactors_per_test`, `track_orphaned_uds_per_test`, `detect_runaway_subactors_per_test` fixtures - `test_nested_multierrors`: parametrize over `depth` `{1, 3}`, add MTF `xfail(strict=False)` with detailed race-window comment explaining the BEG shape mismatch, wrap body in `fail_after_w_trace` with per-backend timeout budget, bump `@tractor_test(timeout=10)`, drop old multiprocessing depth special-casing - `test_multierror_fast_nursery`: wrap in `fail_after_w_trace(30.0)`, accept `TooSlowError` in `pytest.raises`, surface explicit `pytest.fail` on hang - `test_cancel_while_childs_child_in_sync_sleep`: swap `spawn_backend` param for `is_forking_spawner`, widen `fail_after` delay for fork-based spawners - `test_remote_error`, `test_multierror`, `test_cancel_infinite_streamer`, `test_some_cancels_all`: add `set_fork_aware_capture` fixture param - Drop commented-out per-test `skipon_spawn_backend` blocks (now covered by module-level `pytestmark`) (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-05-13 20:10:02 -04:00
Gud Boi	5372fd178a	Add snapshot evidence to cancel-cascade MTF issue doc Append "Snapshot evidence (2026-05-13)" section to `cancel_cascade_too_slow_under_main_thread_forkserver_issue.md` documenting `fail_after_w_trace` diag capture results for `test_nested_multierrors` under the MTF backend — reproduction cmd, ptree analysis, observed hang signature, and updated triage plan. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-05-13 20:02:02 -04:00
Gud Boi	01ce2857ea	Add init-adopted orphan reap to `reap_subactors_per_test` Post-yield now also reaps init-adopted (`ppid==1`) tractor procs that appeared during the test — leaked subactors whose mid-tier parent died during cascade teardown, reparenting them to init. Pre-yield snapshot of existing orphans scopes reap to THIS test's leaks only, avoiding reap of unrelated tractor uses (piker, etc.) on the box. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-05-13 19:59:36 -04:00
Gud Boi	8de684f5de	Add subtree-walk to `reap()` for full actor-tree teardown `reap(include_descendants=True)` now expands each orphan-root pid into its full psutil subtree before delivering SIGINT, so a multi-level leaked actor-tree gets torn down in a single pass instead of requiring repeated calls (each pass kills the current `ppid==1` level, the level below becomes init-adopted, etc.). Falls back to the original flat `pids` list when `psutil` is unavailable. Emits a log line when expansion adds descendant pids. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-05-13 19:53:25 -04:00
Gud Boi	fb87c36263	Add hang-snapshot session index to pytest summary - `_testing/trace.py`: add `_SNAPSHOT_INDEX` session- scoped list populated by `_do_capture_snapshot()` on each successful dump; add TODO for future `TRACTOR_TRACE_HOLD=1` pause-on-hang mode - `_testing/pytest.py`: add `pytest_terminal_summary` hook that prints all captured snapshot dirs at end-of-session so paths don't get buried in scrollback (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-05-13 19:00:18 -04:00
Gud Boi	e329c3108c	Bump to latest `pytest` release!	2026-05-13 18:47:19 -04:00
Gud Boi	3a243a1fd4	Add stray-proc scan + refine `_testing.trace` capture Deats, - `_find_tractor_strays()`: scan `/proc/*/cmdline` for `tractor._child` procs NOT in the walk's `seen` set — surfaces ghost subactor trees from prior test runs (cross-test launchpad contamination). - `dump_proc_tree(include_strays=True)`: refactor classification into `_classify_walk()` closure, walk stray roots as additional trees, emit stray-root summary in header. Also: `tractor._child` procs reparented to init are now always classified as orphans regardless of cgroup-slice (leaked subactor ≠ desktop-launched app). - `_do_capture_snapshot()`: use `sys.__stderr__` to bypass pytest `--capture=sys` redirection so snapshot paths always land on the real terminal - `fail_after_w_trace()`: capture diag snapshot on non-`TooSlowError` exceptions when the `fail_after` scope's cancel had already fired (e.g. nursery wraps `Cancelled` into a `BaseExceptionGroup` that escapes before `TooSlowError` can be raised). (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-05-13 18:46:04 -04:00
Gud Boi	7509e313ff	Mv core impl `tractor_diag.xsh` to `_testing.trace` Extract all pure-Python diagnostic helpers (`dump_proc_tree`, `dump_hung_state`, `scan_bindspace`, `dump_all`, `resolve_pids`, `ensure_sudo_cached`, etc.) from the xonsh xontrib into a new `tractor/_testing/trace.py` module so the same logic is callable from both the `acli.*` terminal aliases AND in-test capture-on-hang fixtures. Deats, - `_testing/trace.py`: new module (1171 lines) — proc-tree walker, hung-state dumper, bindspace scanner, `dump_all()` snapshot archiver, `AFKAlarmTimeout` exc, `fail_after_w_trace()` async CM (trio `fail_after` + auto-snapshot on `TooSlowError`), `afk_alarm_w_trace()` sync CM (`signal.alarm` + snapshot on `SIGALRM`), plus pytest fixture wrappers for both. - `_testing/pytest.py`: re-export the two fixtures via `from .trace import` so pytest plugin-discovery picks them up. - `tractor_diag.xsh`: thin terminal wrappers that import from `_testing.trace` — drops ~627 lines of inline impl. Add `acli.dump_all` alias for full snapshot-bundle CLI access. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-05-13 16:47:17 -04:00
Gud Boi	7ee0dc2e8f	Harden `test_infected_asyncio` for fork spawners Deats, - `test_echoserver_detailed_mechanics`: add `is_forking_spawner` param, wrap `main()` in `fa_main()` with per-backend `trio.fail_after` (4s fork / 1s trio) to cap cancel-cascade teardown that compounds under forkserver. - `test_sigint_closes_lifetime_stack`: swap `start_method` param for `is_forking_spawner`, pre-init `tmp_file`/`ctx` to `None` so KBI firing before `open_context` body doesn't `UnboundLocalError`, add `pytest.fail` guard for the spawn-time IPC race case, arm `signal.alarm` AFK-safety cap (10s) under fork backends Also, - `pytestmark`: add `track_orphaned_uds_per_test` + `detect_runaway_subactors_per_test` fixtures. - `delay()`: hardcode `return 1e3` at top (debug override still in place). (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-05-13 15:56:35 -04:00
Gud Boi	b10011a36e	Adjust `test_streaming_to_actor_cluster` timeout For forking spawner backends that is. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-05-13 15:47:36 -04:00
Gud Boi	7d0a53d205	Enrich `pytestmark` in `test_inter_peer_cancellation` - `skipon_spawn_backend('subint')`: expand reason with specific analysis doc refs + GH issue #379 umbrella link. - add `track_orphaned_uds_per_test` fixture via `usefixtures` to blame-attribute UDS sock-file orphans left by SIGKILL cancel cascades. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-05-13 12:28:17 -04:00
Gud Boi	75d5b4cf7b	Adjust `test_simple_context` timeout for forking spawner (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-05-13 12:03:58 -04:00
Gud Boi	8aa07a7932	Add `set_fork_aware_capture`, timeout to msg tests - `test_ext_types_over_ipc`: wrap `main()` in `fa_main()` with `trio.fail_after(2)` + commented `capfd.disabled()` investigation (pytest#14444). - `test_basic_payload_spec`: add fixture param with note on fork-spawner hang prevention. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-05-13 11:59:37 -04:00
Gud Boi	10db117864	Add signal-alarm guard to `test_dynamic_pub_sub` Outer `signal.alarm` cap that fires even when trio's `fail_after` is blocked by a shielded-await deadlock (the bug-class-3 hang under MTF backends). Only armed for fork-based spawners where the bug lives. Deats, - `_DIAG_CAP_S = fail_after_s + 5` — slightly larger than the trio-native guard so it always loses when the in-band path works. - `test_log.cancel()` breadcrumbs at each cancel-scope boundary so the last-fired breadcrumb names the swallow point on hang. - try/finally wrapping around each scope level for deterministic breadcrumb emission. - add `is_forking_spawner`, `set_fork_aware_capture` fixture params. - rework `fail_after_s`: 4s for fork, 12s for trio (was 30/12). Also, - `test_sigint_both_stream_types`: `assert 0` -> `pytest.fail()`, add TODO re `pytest.raises()`. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-05-13 11:43:17 -04:00
Gud Boi	83b6a3373a	Fix `is_forking_spawner` fixture to call helper fn (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-05-13 11:20:17 -04:00
Gud Boi	9bbb6f796b	Add ppid-aware liveness buckets to `bindspace_scan` Split the old `live`/`orphans` sock classification into three ppid-aware buckets: `live-active` (PID alive, parent owns it), `orphaned-alive` (PID alive but `ppid==1`, init-adopted — `acli.reap` candidate), and `orphaned-dead` (PID gone, sock stale). Deats, - new `_ppid()` helper reads `/proc/<pid>/stat` field [3] for parent PID, handles the tricky `(comm)` field (can contain spaces/parens) by splitting from last `)`. - live-active rows now show `(ppid=<N>)` for ctx. - orphaned-alive rows flagged `(adopted by init)`. - cleanup suggestion: `acli.reap --uds` for both alive-orphan graceful cancel + dead-sock cleanup in one shot; manual `rm` kept as fallback. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-05-13 10:14:04 -04:00
Gud Boi	a24600f1a7	Add `main_thread_forkserver` CI matrix rows Add `capture` dimension to CI matrix so fork-based backends run `--capture=sys` (fork-child × `--capture=fd` is a known deadlock). Non-fork backends keep `fd`. Deats, - two `include:` rows for `main_thread_forkserver` on linux py3.13: tcp + uds, both `capture: 'sys'` - job name updated to show `capture=` mode - timeout bumped 16 -> 20 min to accommodate the additional matrix cells - `--capture=${{ matrix.capture }}` replaces hardcoded `--capture=fd` (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-05-13 10:10:27 -04:00
Gud Boi	92443dc4ef	Add boot-race conc-anal, widen `xfail` to `n_dups=8` New `ai/conc-anal/spawn_time_boot_death_dup_name_issue.md` documenting the spawn-time rc=2 race under rapid same-name spawning against a forkserver + registrar — the `wait_for_peer_or_proc_death` helper now surfaces the death instead of parking forever on the handshake wait. Also, - extract inline `xfail` into module-level `_DOGGY_BOOT_RACE_XFAIL` marker. - apply it to `n_dups=8` too (previously bare) bc larger N widens the race window enough to fire occasionally. - link to tracking issue #456. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-05-13 09:45:45 -04:00
Gud Boi	d3cbc92751	Adjust legacy streaming test timeouts for fork+UDS Forking spawner + UDS transport has different timing vs `trio_proc` — streaming example completes faster in some cases, slower in others depending on fork overhead + sock setup. Deats, - add `expect_cancel` param to `cancel_after()`, raise `ActorTooSlowError` when cancel scope fires unexpectedly instead of silently returning `None`. - `time_quad_ex` fixture: bump timeout +1 for forking+UDS, explicit `ActorTooSlowError` on `None` result instead of bare `assert results`. - `test_not_fast_enough_quad`: `xfail` for forking+UDS being "too fast" (cancel doesn't fire bc streaming finishes before delay). - add `is_forking_spawner`, `tpt_proto` fixture params throughout. Also, - `_testing/pytest.py`: widen `start_method` parametrize and `is_forking_spawner` fixture to `scope='session'`. - `"""` -> `'''` docstring style throughout. - hoist `_non_linux` to module scope (was redefined locally in two places). - type hints, kwarg-style `partial()` calls. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-05-11 21:43:19 -04:00
Gud Boi	099104e0af	Add bare-name arg, `ss` hints to `bindspace_scan` `acli.bindspace_scan piker` now resolves `<name>` to `$XDG_RUNTIME_DIR/<name>` — useful for projects like `piker` that bind sibling sub-dirs alongside tractor's default. Full paths still work as-is. Also, - rename "unparseable" section to "non-tractor" with clearer desc (filename lacks `@<pid>` suffix) - print per-sock `ss -lpx 'src = <path>'` cmds for non-tractor socks so callers can manually resolve listener-PID liveness (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-05-11 20:34:07 -04:00
Gud Boi	abd3950ba6	Harden `test_registrar` with reap fixtures, timeouts Add module-level `pytestmark` applying per-test `reap_subactors_per_test`, `track_orphaned_uds_per_test`, and `detect_runaway_subactors_per_test` fixtures — registrar tests stress discovery roundtrips that historically left orphaned UDS sock-files. Deats, - drop unused `say_hello()` fn, keep only `say_hello_use_wait`; rename param `func` -> `ria_fn`. - use `@tractor_test(timeout=7)` instead of separate `@pytest.mark.timeout(7, method='thread')` decorator. - add `with_timeout()` helper, wire into `test_subactors_unregister_on_cancel_remote_daemon`. - uncomment `_timeout_main()` in `test_stale_entry_is_deleted`, use configurable `timeout` var + `debug_mode` guard for `tractor.pause()` on cancel. - `dump_on_hang(seconds=timeout*2)` instead of hardcoded `20`. - fix typo "oustanding" -> "outstanding". (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-05-11 20:24:41 -04:00
Gud Boi	7d1e4462d4	Adjust `subint_forkserver` docs to match stub impl Comment/docstring updates: `subint_forkserver` is a clean `NotImplementedError` stub — not an alias to variant-1 (`main_thread_forkserver`). Key reserved in-place (not aliased) so the subint-hosted-child impl can flip without API churn once jcrist/msgspec#1026 unblocks PEP 684 subints. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-05-08 02:51:21 -04:00
Gud Boi	522b57570b	Add `_is_tractor_subactor()`, cgroup-aware `ptree` Rework reap/diag tooling to identify tractor sub-actors via intrinsic proc signals — cmdline/comm markers from `setproctitle` — instead of env-var or cwd matching. Deats, - new `_is_tractor_subactor()` checks cmdline for `tractor[` / `tractor._child` markers, falls back to `/proc/<pid>/comm` for zombie-resilient detection (kernel preserves `comm` past exit until reap) - `_read_comm()` reads kernel per-task name set by `setproctitle()` — the zombie-safe ID signal - `_read_status_state()` reads single-letter proc state from `/proc/<pid>/status` (`Z` = zombie) - `find_orphans()` drops `repo_root` requirement, uses `_is_tractor_subactor()` for intrinsic sub-actor ID instead of cwd coincidence-matching - new `find_zombies()` with optional `parent_pid` filter for zombie-state sub-actors Also, - rename `pytree` -> `ptree` throughout xontrib - add `_which_cgroup_slice()` — reads `/proc/<pid>/cgroup` to distinguish `system.slice` services vs `user.slice` desktop apps from genuinely leaked orphans - `_ptree` classifies `ppid==1` procs into `system-slice`, `user-slice`, and `orphans` buckets with per-section output - `_tractor_reap` drops `git rev-parse` / `sys.path` hack — assumes tractor importable from active venv (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-05-08 00:51:05 -04:00
Gud Boi	d60245777e	Add per-actor `setproctitle` via `devx._proctitle` New `tractor.devx._proctitle` mod sets each sub-actor's `argv[0]` (and kernel `comm`) to `tractor[<aid.reprol()>]` — e.g. `tractor[doggy@1027301b]` — so `ps`/`top`/`htop` and `acli.pytree`/reaper tooling can identify actors at a glance without parsing full cmdlines. Deats, - `set_actor_proctitle()` wraps the `setproctitle` pkg with `ImportError` guard; optional at runtime but listed in `pyproject.toml` so default installs benefit. - called early in `_child._actor_child_main()` after `Actor` construction, before `_trio_main()` entry. - tests in `tests/devx/test_proctitle.py`: format unit test, `/proc/{cmdline,comm}` integration test, negative detection test. Resolves #457 (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-05-08 00:04:48 -04:00
Gud Boi	caebf60f4e	Add dup-name cancel-cascade escalation test Extend `test_register_duplicate_name` w/ cancel-level log breadcrumbs and `try/finally` for better diag on the cancel-cascade hang. Add `test_dup_name_cancel_cascade_escalates_to_hard_kill` as a regression test for the TCP+MTF duplicate-name cancel-cascade deadlock. Spawns N same-name actors, calls `an.cancel()`, and asserts teardown completes within a `trio.fail_after()` budget that scales w/ `n_dups`. Deats, - parametrize `n_dups` (2, 4, 8) to widen the race window for concurrent `register_actor` RPCs. - `n_dups=4` xfail'd — exposes a separate boot-race bug (doggy `rc=2` under rapid same-name spawn), tracked in #456. - post-teardown asserts all `Portal` chans disconnect, verifying hard-kill escalation worked. Relates to https://github.com/goodboy/tractor/issues/456 (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-05-07 23:33:23 -04:00
Gud Boi	3b0724eba8	Add `wait_for_peer_or_proc_death()` to `_spawn` Race `IPCServer.wait_for_peer(uid)` against the sub-proc's `.wait()` inside a `trio` nursery; whichever completes first cancels the other. Prevents the spawning task from parking forever on an unsignalled `_peer_connected[uid]` event when a sub-actor dies during boot (e.g. crashed on import before reaching `_actor_child_main`). Instead of hanging, raises `ActorFailure` w/ the proc's exit code for clean supervisor error reporting. Also, - use the new racer in `main_thread_forkserver_proc()` spawn path. - keep `proc_wait` generic so each backend passes its own callable (`trio.Process.wait`, `_ForkedProc.wait`, etc.). (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-05-07 22:18:29 -04:00
Gud Boi	cec6cc2a56	Add `acli.reap`, namespace `tractor_diag` cmds Group all xontrib aliases under an `acli.` prefix so xonsh prefix-completion treats them as a sub-cmd group — `acli.<TAB>` lists the full set. No parent `acli` cmd exists; the dot is purely naming. Renames (incl `-` -> `_` in suffixes for shell- identifier-friendliness): - `pytree` -> `acli.pytree` - `hung-dump` -> `acli.hung_dump` - `bindspace-scan` -> `acli.bindspace_scan` Add new `acli.reap` wrapping `scripts/tractor-reap`: Deats, - 3 opt-in phases via flags: 1. process reap — `find_orphans()` (default, PPid=1 + cwd=repo + cmdline `python`) or `find_descendants(--parent PID)`. SIGINT first, SIGKILL after `--grace` (def 3.0s). 2. `/dev/shm` sweep (`--shm`/`--shm-only`) — `find_orphaned_shm()` + `reap_shm()`. needed bc `tractor` disables `mp.resource_tracker`. 3. UDS sock-file sweep (`--uds`/`--uds-only`) — `find_orphaned_uds()` + `reap_uds()` for stale `${XDG_RUNTIME_DIR}/tractor/<name>@<pid>.sock` entries. See #452. - `--dry-run` lists matches without signalling/ unlinking; survivor pids or sweep errors flip the alias rc to `1`. - lazy-imports `tractor._testing._reap` after `git rev-parse --show-toplevel` (with `Path(__file__).parent.parent` fallback) so the contrib is loadable before the venv is on `sys.path`. - `argparse.SystemExit` on `-h`/bad-args is caught + returned as the alias rc instead of killing xonsh. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-05-07 18:07:34 -04:00
Gud Boi	34f333a026	Escalate cancel-ack timeouts to `proc.terminate()` Wires SC-discipline cancel-then-escalate into `ActorNursery.cancel()`: graceful cancel-req -> bounded wait -> hard-kill Deats, - add `raise_on_timeout: bool = False` kwarg to `Portal.cancel_actor()`. When `True`, bounded- wait expiry raises `ActorTooSlowError` instead of the legacy DEBUG-log + return-`False` path. Default stays `False` for callers that handle their own escalation (e.g. `_spawn.soft_kill()` polling `proc.poll()`). - add `_try_cancel_then_kill()` helper in `_supervise` used by per-child cancel tasks. On `ActorTooSlowError`, escalates via `proc.terminate()` (SIGTERM) so a non-acking sub doesn't park `soft_kill()` forever waiting on `proc.poll()`. - replace `tn.start_soon(portal.cancel_actor)` in `ActorNursery.cancel()` with the helper. Debug-mode bypass: ----------------- skip escalation (fall back to legacy fire-and-forget cancel) when ANY of: - `Lock.ctx_in_debug is not None` (some actor is currently REPL-locked) - `_runtime_vars['_debug_mode']` (root opened with `debug_mode=True`). - `ActorNursery._at_least_one_child_in_debug` (per-child `debug_mode=` opt-in). ORing covers root-debug, child-debug, and active- REPL-lock cases without false-positively SIGTERM- ing a sub-tree proxying stdio for a REPL session. Motivated by the `subint_forkserver` dup-name hang where a same-named sibling subactor's cancel-RPC failed to ack within `Portal.cancel_timeout` (TCP+ forkserver register-RPC contention) and the nursery `__aexit__` deadlocked. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-05-07 18:01:59 -04:00
Gud Boi	38ffb875bd	Add `ActorTooSlowError` for cancel-cascade timeouts Distinct from `trio.TooSlowError` so that existing `except trio.TooSlowError:` blocks don't silently mask actor-cancel timeouts — these must propagate to let a supervisor escalate to `proc.terminate()` per SC-discipline: graceful cancel-req -> bounded wait -> hard-kill Motivated by #subint_forkserver dup-name hang where `Portal.cancel_actor()` silently swallowed the timeout and the supervisor never escalated, leaving a same-named sibling subactor parked forever. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-05-07 16:39:10 -04:00
Gud Boi	4c00913b3b	Add `terminate()` to `_ForkedProc` Sends `SIGTERM` (graceful shutdown) instead of the existing `kill()` which sends `SIGKILL`. Mirrors the `trio.Process.terminate()` / `multiprocessing.Process.terminate()` interface. Used by `ActorNursery.cancel()`'s per-child escalation when `Portal.cancel_actor()` raises `ActorTooSlowError`, and by the legacy `hard_kill=True` branch. Swallows `ProcessLookupError` (child already dead) same as `kill()`. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-05-07 16:35:18 -04:00
Gud Boi	5cd06810db	Tidy proto-guard `ValueError` fmt in `open_root_actor()` Pre-compute `mismatch_lines` str instead of `+`-concat inside the f-string raise site; slightly easier to read and avoids the `+ '\n\n'` continuation. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-05-07 16:24:23 -04:00
Gud Boi	255c9c3a7c	Mk `--capture` guard CI-aware w/ local warn Refactor `pytest_load_initial_conftests()` to split the fork-spawn × capture-mode check into two policies: - CI (`CI` env-var set): `pytest.exit(rc=2)` on mismatch — forces every matrix-row to declare `--capture=sys` explicitly. - local: `warnings.warn()` + continue — lets devs experiment with `--capture=fd` to validate fixes. Deats, - drop `_cap_fd_set` global; add `_CAPSYS_REQUIRED_SPAWNERS` frozenset for the spawner-name lookup - move inline comment wall → proper docstring w/ Background, Trade-off, Validation-policy sections - `maybe_xfail_for_spawner()` now takes `request: pytest.FixtureRequest` and reads `request.config.option.capture` instead of the `_cap_sys_passed_as_flag` global - recognize `tee-sys` as fork-safe (only `fd`-level capture deadlocks) - `set_fork_aware_capture()` returns the actual capture mode str from config, not a hardcoded `'sys'` - lift `import warnings` to module level (was duped inside `pytest_configure`) (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-05-07 16:17:13 -04:00
Gud Boi	0f4e671862	Add `--tree` flag and cross-bucket parent annos to `pytree` Extend `pytree` with two usability improvements: - `--tree`/`-t` opt-in flag emits a flat walk-order `## tree` section at the top preserving contiguous parent-child shape (no severity-grouping), so the full tree structure is visible without cross-ref'ing between severity buckets. - Cross-bucket parent annotation: when a row's parent (by ppid) lives in a different severity bucket, suffix with `[parent: <pid> (in `<bucket>`)]` so the `└─` marker resolves even when bucketing scatters parent/child into separate sections. Also, - split arg parsing into flag vs positional args. - add `pid_to_bucket` dict + `walk_order` list to back both features - rename inner `ppid` shadow to `ppid_str` to avoid collision with the outer `ppid` variable. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-05-06 19:04:55 -04:00
Gud Boi	d036ef7d7f	Add `enable_transports`/`registry_addrs` proto guard Raise `ValueError` from `open_root_actor()` when any `registry_addrs` entry uses a transport proto not in `enable_transports` — historically this caused a silent indefinite hang during the registrar handshake (the actor could never connect to register/discover). Also, - update `test_root_passes_tpt_to_sub` to detect a proto mismatch between parametrized `tpt_proto_key` and CLI `tpt_proto`, asserting the new guard raises `ValueError` with expected msg content. - replace old commented-out notes with a clearer explanation of the mismatch foot-gun. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-05-06 15:13:02 -04:00
Gud Boi	7882c37ce0	Add `RuntimeVars` env-var lift design plan Draft plan for consolidating pytest CLI flags, ad-hoc env vars, and hardcoded fixture defaults into the existing (but unused) `RuntimeVars` struct as the single source of truth. Deats, - `_rtvars.py` leaf mod w/ `dump`/`load`/`get`/ `update` helpers using `str(dict)` + `ast.literal_eval` encoding - phased migration: test infra first, then runtime callers, then per-session bindspace - addresses concurrent pytest session collisions and subproc env propagation for `devx/` scripts (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-05-06 15:02:13 -04:00
Gud Boi	2ee44a6fdd	Fix shutdown deadlock on UDS unlink race Wrap `os.unlink()` in `close_listener()` with a `FileNotFoundError` guard — under concurrent pytest sessions the sock-file can already be reaped. Without this the raise aborts `_serve_ipc_eps`'s finally before `_shutdown.set()`, deadlocking `wait_for_shutdown()` on `actor.cancel()`. Also, - close each endpoint independently in the finally so one raise doesn't strand the rest. - always signal `_shutdown.set()` regardless of remaining ep count. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-05-06 14:11:51 -04:00
Gud Boi	7b14fdcd96	Add `tractor_diag`(nosis) xontrib with aliases Xonsh xontrib providing three diagnostic commands for tractor development / hang investigation: - `pytree <pid\|pat>` — psutil-backed proc tree with severity-bucketed output (zombies > orphans > live), tree-depth markers, zombie-safe rendering. - `hung-dump <pid\|pat>` — kernel `wchan`/`stack` + `py-spy dump --locals` per descendant, sudo-cred caching upfront, pgrep fallback when psutil absent. - `bindspace-scan [<dir>]` — scan UDS bindspace for orphaned `<name>@<pid>.sock` files whose binder pid is dead, emit `rm` one-liner for cleanup. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-05-06 14:07:24 -04:00
Gud Boi	e4953851de	Mk per-test reap fixtures opt-in Rename `_track_orphaned_uds_per_test` and `_detect_runaway_subactors_per_test` to public names (drop `_` prefix), drop `autouse=True`. Tests that need per-test reap blame now opt in via `pytestmark = pytest.mark.usefixtures(...)`. Also, - reduce `sample_interval` from 0.5 -> 0.05s so the CPU probe is cheaper per pid. - add empty-`only_pids` fast-path in `find_runaway_subactors` to skip psutil import when no descendants were spawned. - extract `new_pids` intermediate var for clarity. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-05-06 13:29:49 -04:00
Gud Boi	c4082be876	Mv `daemon` + `test_multi_program` to `discovery/` All `daemon` fixture consumers are discovery- protocol tests now living under `tests/discovery/`. Move the fixture, its `_wait_for_daemon_ready` helper, and `test_multi_program.py` into that subdir so scope matches usage. Also, - add `pytestmark` for `track_orphaned_uds_per_test` + `detect_runaway_subactors_per_test` to `test_multi_program` as regression net. - drop now-unused `_PROC_SPAWN_WAIT` + `socket` import from root conftest. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-05-06 13:23:42 -04:00
Gud Boi	ec8c4659c4	Replace sleep with active poll in `daemon` fixture First draft at resolving, https://github.com/goodboy/tractor/issues/424 `tests.conftest.py.daemon()` previously used a blind `time.sleep(_PROC_SPAWN_WAIT + uds_bonus + ci_bonus)` to "wait for the daemon to come up" before yielding the proc to the test. Two problems: 1. Racy under load — sleep is fixed at design time; loaded boxes / cold starts / fork-spawn cost spikes blow past it, leading to `ConnectionRefusedError` /`OSError: connect failed` flakes in `test_register_duplicate_name`. 2. Wasteful when daemon comes up fast — happy-path pays the FULL sleep regardless. ~3s of dead time per fixture invocation, ~10-20s per full suite run. Replace with `_wait_for_daemon_ready()` — active poll via stdlib `socket.create_connection` (TCP) or `socket.connect` (UDS) on the daemon's bind addr, with 50ms backoff and a 10s/15s deadline (CI gets extra headroom). Daemon-died-during-startup early-exit catches the case where `_PROC_SPAWN_WAIT` was silently masking daemon startup crashes. Why stdlib `socket` (Option 2 from the conc-anal doc) instead of `tractor`'s own `_root.ping_tpt_socket` closure or trio? - `tractor.run_daemon()` doesn't return from bootstrap until the runtime is fully ready to handle IPC, so probing listen-side acceptance is sufficient. - no need to do the full IPC handshake just to validate readiness. Sidesteps the `trio.run()` bootstrap cost (~50ms) per fixture too. `claude`'s verification: 10/10 runs of `tests/test_multi_program.py` pass on both `--tpt-proto=tcp` and `--tpt-proto=uds`. Per-test wall-time `test_register_duplicate_name`: 4.31s → 1.10s. Full file: ~12s → 3.27s per transport. Doc-tracked at: `ai/conc-anal/test_register_duplicate_name_daemon_connect_race_issue.md` Future work — session-scoped trio runtime in a bg thread to share fixture-side trio operations across many fixtures (currently overkill for the one fixture that needs it). (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-05-04 20:03:41 -04:00
Gud Boi	29f9928524	Add `test_register_duplicate_name` race analysis Document the intermittent connect-refused failure in the registrar daemon test — root cause is the `daemon` fixture's blind `time.sleep()` readiness gate racing against the subproc's `bind()`/ `listen()` completion. Distinct from the cancel- cascade `TooSlowError` flake class. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-05-04 20:01:08 -04:00
Gud Boi	086e9f2c07	Use single f-string per pid in runaway warning (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-05-04 19:58:11 -04:00
Gud Boi	9031605807	Harden `test_debugger` for forkserver spawners Use `is_forking_spawner` fixture + gate spawner- specific expect patterns in nested-error and daemon tests. Add `set_fork_aware_capture` to multi-sub tests that need capture-mode awareness. Deats, - replace `start_method` param with `is_forking_spawner` bool fixture. - bump inter-send delay to 0.1s for IPC stability under fork backends. - gate `bdb.BdbQuit` + relay-uid patterns behind `not is_forking_spawner` (not visible under capsys). - add `expect(child, EOF)` to confirm clean exit. - switch caught exc from `AssertionError` to `ValueError` in daemon test. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-05-04 19:21:49 -04:00
Gud Boi	c4885f9d99	Drop global mutation of `_PROC_SPAWN_WAIT` In top level `daemon`-fixture that is.. Use a local `bg_daemon_spawn_delay` instead of mutating the module-level `_PROC_SPAWN_WAIT` — previously each `daemon` fixture invocation would permanently add 1.6s (UDS) or 1s (CI) to the global, inflating delays across the session. Also, emit a `test_log.warning()` when verbose loglevel is silently reduced to `'info'`. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-05-04 16:23:50 -04:00
Gud Boi	60ce713016	Add cancel-cascade `TooSlowError` flake analysis Document the ~0.3% rotating `trio.TooSlowError` flake under `--spawn-backend=main_thread_forkserver` full-suite runs. Root cause: `hard_kill`'s per-sub 1.6s graceful timeout compounding across N subactors in a cancel cascade, plus cumulative autouse-reaper teardown overhead. Covers symptom, observed flaking tests, root-cause family, ranked mitigations (cap bump -> CPU-count- aware cap -> `pytest-rerunfailures` -> `hard_kill` tuning -> targeted profiling), and a verification protocol. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-05-04 13:56:51 -04:00
Gud Boi	0ef549fadb	Add `tractor.trionics.patches` subpkg + first fix With a seminal patch fixing `trio`'s `WakeupSocketpair.drain()` which can busy-loop due to lack of handling `EOF`. New `tractor.trionics.patches` subpkg housing defensive monkey-patches for upstream `trio` bugs we've encountered while running `tractor` — particularly as of recent, fork-survival edge cases that haven't been filed/fixed upstream yet. Each patch is idempotent, version-gated via `is_needed()`, and carries a `# REMOVE WHEN:` marker pointing at the upstream release whose adoption allows deletion. Subpkg layout + per-patch contract documented in `tractor/trionics/patches/README.md` — `apply()` / `is_needed()` / `repro()` API, registry pattern via `_PATCHES` in `__init__.py`, single-call entry point `apply_all()`. First patch, `_wakeup_socketpair`: - `trio`'s `WakeupSocketpair.drain()` loops on `recv(64KB)` and exits ONLY on `BlockingIOError`, NEVER on `recv() == b''` (peer-closed FIN). - under `fork()`-spawning backends the COW-inherited socketpair fds & `_close_inherited_fds()` teardown can leave a `WakeupSocketpair` instance whose write-end is closed, and `drain()` then spins forever in C with no Python checkpoints, - this obviously burns 100% CPU and no signal delivery. Standalone repro: from trio._core._wakeup_socketpair import WakeupSocketpair ws = WakeupSocketpair() ws.write_sock.close() ws.drain() # spins forever Patch is one-line — break the drain loop on b'' EOF. Manifested as two distinct test failures: - `tests/test_multi_program.py::test_register_duplicate_name` hung at 100% CPU on the busy-loop directly (fork child's worker thread) - `tests/test_infected_asyncio.py::test_aio_simple_error` Mode-A deadlock — busy-loop wedged trio's scheduler inside `start_guest_run`, both threads parked in `epoll_wait`, no TCP connect-back to parent ever happened. Same patch fixes both. Restored 99.7% pass rate on full suite under `--spawn-backend=main_thread_forkserver` (was hanging indefinitely before). Wired into `tractor._child._actor_child_main` via `apply_all()` BEFORE any trio runtime init. Harmless on non-fork backends. Conc-anal write-ups, including strace + py-spy evidence: - `ai/conc-anal/trio_wakeup_socketpair_busy_loop_under_fork_issue.md` - `ai/conc-anal/infected_asyncio_under_main_thread_forkserver_hang_issue.md` Regression tests in `tests/trionics/test_patches.py`: each test asserts (a) the bug exists pre-patch (or is fixed upstream — skip cleanly), (b) the patch fixes it with a SIGALRM wall-clock cap so a regression hangs loud instead of silently. TODO: - [ ] file the upstream `python-trio/trio` issue + PR. - [ ] use the `repro()` callable in `_wakeup_socketpair.py` IS the issue body's evidence section. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-05-04 12:18:03 -04:00
Gud Boi	e9712dcaeb	Add `tractor.spawn._reap.unlink_uds_bind_addrs()` Inside a new new `tractor.spawn._reap` submod which kicks off providing post-mortem subactor cleanup primitives, parent-side; consider it the "sibling" of `tractor._testing._reap` which is the test-harness-oriented brother mod. Today: `unlink_uds_bind_addrs()` provides a starter bug-fix for #454 where `hard_kill()`'s `SIGKILL` bypasses the subactor's `_serve_ipc_eps`-`finally:` `os.unlink(addr.sockpath)`, leaking `${XDG_RUNTIME_DIR}/tractor/<name>@<pid>.sock` files.. This adds 2 cleanup paths: - explicit `bind_addrs` (when set at spawn time), OR - convention-based reconstruction from `subactor.aid.name + proc.pid` for the random-self-assign case. `.spawn.hard_kill()` now invokes the cleanup unconditionally post-`SIGKILL`; graceful-exit case is a no-op via `FileNotFoundError` skip. Future work — authoritative tracking via a per-process UDS bind-addr registry — documented in module docstring, deferred to a follow-up PR. Co-fix: `tractor/spawn/_trio.py::new_proc` already passes `bind_addrs` + `subactor` to `hard_kill` via prior work on this branch. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-05-04 11:13:59 -04:00
Gud Boi	5cf0312c78	Add per-test runaway-subactor CPU detector to `_reap` New `find_runaway_subactors()` helper + autouse `_detect_runaway_subactors_per_test` fixture that samples `psutil.cpu_percent()` on descendants to catch tight-loop bugs (e.g. #452-class `recvfrom` on a closed socket). Checks both at setup (leftovers from a prior hung test) and teardown (spawned by this test). Intentionally does NOT kill the runaway — emits a loud warning with diag commands (`strace`, `lsof`, `ss`, `kill`) so the pid stays alive for hands-on investigation. Session-end reaper still SIGINT/SIGKILL survivors on normal exit. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-05-04 10:15:55 -04:00

1 2 3 4 5 ...

2705 Commits (subint_forkserver_backend) All Branches Search

2705 Commits (subint_forkserver_backend)

All Branches