tractor

Commit Graph

Author	SHA1	Message	Date
Gud Boi	6d94a67251	Fix `maybe_override_capture` to not get invalid capX fixture names.. (cherry picked from commit `32e89c67ee`)	2026-06-09 20:28:04 -04:00
Gud Boi	a4f8496498	Add fork-aware capture fixtures to `_testing.pytest` Extend the pytest plugin with helpers that detect and adapt to `--capture=sys` under fork-based spawners (`main_thread_forkserver`, `mp_forkserver`) where fd-capture causes hangs. Deats, - track `_cap_sys_passed_as_flag` + `_cap_fd_set` globals in `pytest_load_initial_conftests()`. - add `@pytest.hookimpl(tryfirst=True)` + re-parse args after appending `--capture=sys`. - `_is_forking_spawner()` predicate + fixture. - `maybe_xfail_for_spawner()` — enalbes skipping tests that need capsys but weren't passed `--capture=sys`. - `set_fork_aware_capture` fixture — returns the appropriate capture fixture per spawner backend based on `start_method: str` set via CLI. - wire `set_fork_aware_capture` into `tractor_test` wrapper's fixture injection. Also, - add `alert_on_finish` session fixture (terminal bell on completion; tho not sure it works fully..) - add `ids=` to `start_method` parametrize. - restore `default=False` on `--enable-stackscope`. - drop commented-out `--ll` option block; we will likely factor it to our plugin eventually however.. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `d549c72052`)	2026-06-09 20:28:04 -04:00
Gud Boi	f7c048e535	Adjust `test_shield_pause` for capsys backends Under `main_thread_forkserver` the bootstrapping hook switches to `--capture=sys`, so subactor fd-level output (tree dumps, zombie-reaper msgs) isn't captured per-test by pexpect. Gate those expects behind a `no_capfd` check so the test passes on both capture modes. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `5a9926fc32`)	2026-06-09 20:28:04 -04:00
Gud Boi	4d8e67bd7f	Default `--ll` to `None` in test harness Only override `tractor.log._default_loglevel` when the flag is explicitly passed — lets per-spawn and per-example `loglevel` kwargs take effect instead of being clobbered by the hard-coded `'ERROR'` default. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `72a0465c52`)	2026-06-09 20:28:04 -04:00
Gud Boi	f18cb0e033	Update debug examples + harden `test_debugger` Pass explicit `loglevel` to `spawn()` calls in `test_debugger` tests — required for pexpect pattern matching now that examples no longer hard-code log levels. Also, - make `expect()` return the decoded `before` str. - add `start_method` param + fork-backend timeout slack (+4s) in nested-error test. - clean up debug examples: drop unused loglevels, rename `n` -> `an`, fix docstrings, add TODO comments for tpt parametrize via osenv. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `9431a81d37`)	2026-06-09 20:28:04 -04:00
Gud Boi	49a397d6d9	Update `sync_bp` + tighten `test_pause_from_sync` Add `disable_pdbp_color()` to the `sync_bp` example to suppress pygments prompt coloring when `PYTHON_COLORS=0` — makes pexpect pattern matching deterministic. Deats, - set `loglevel='pdb'` in both script + test spawn. - disable `enable_stack_on_sig` in example, assert no `stackscope` output in test. - update `attach_patts` keys/values with `\|_<Task` / `\|_<Thread` / `\|_('subactor'` prefixes to match actual tree-dump format. - add call-site patterns (`tractor.pause_from_sync()` `tractor.pause()`, `breakpoint(hide_tb=...)`). - trim trailing `\n` from `Lock.repr()` output. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `fc2e298a29`)	2026-06-09 20:28:04 -04:00
Gud Boi	81bfbcd095	Add `use_stackscope` runtime var for subactor init Track `stackscope` enablement in `RuntimeVars` so the flag propagates to subactors via the standard rtvar IPC path instead of relying solely on the `TRACTOR_ENABLE_STACKSCOPE` env var. Deats, - add `use_stackscope: bool` to `RuntimeVars` struct + defaults dict - `enable_stack_on_sig()` sets the rtvar on successful `stackscope` import, asserts unset on `ImportError` - nest stackscope init under `_debug_mode` gate in `Actor.async_main`, check rtvar alongside env var - defer `maybe_init_greenback` import to its own `use_greenback` branch (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `48523358cf`)	2026-06-09 20:28:04 -04:00
Gud Boi	0df90500fa	Fix `SIGUSR1` tree-dump ordering in `_stackscope` Factor the sub-actor relay loop out of `dump_tree_on_sig()` into `_relay_sig_to_subactors()` and chain both dump + relay in a single `run_sync_soon` callback (`_dump_then_relay`) so the parent's task-tree flushes BEFORE any sub receives the signal — fixes a hierarchical-ordering race where subs could dump ahead of the parent in the muxed pty stream. Also, - gate file/tty sink writes behind `write_file` + `write_tty` params on `dump_task_tree()`. - use `actor.aid.uid` instead of deprecated `.uid`. - update `test_shield_pause` expects to match the new sequential parent -> relay-log -> sub ordering. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `e2b790a70d`)	2026-06-09 20:28:04 -04:00
Gud Boi	363d11b89c	Add `pytest_load_initial_conftests()` for `--capture=` Move `--capture=sys` enforcement from a static ini flag to a `pytest_load_initial_conftests()` bootstrap hook that dynamically flips capture mode only when a fork-based spawner (like `main_thread_forkserver`) is detected; non-fork backends keep `--capture=fd`. Also, - load `tractor._testing.pytest` via `-p` in ini (bc bootstrapping hooks must register before conftest `pytest_plugins` runs). - register `_reap` as sub-plugin via `pytest_plugins` tuple in `._testing.pytest`. - drop now-duplicate reap fixtures (already in `_reap` per `1cdc7fb3`). - rename `tractor_enable_stackscope` dest -> `enable_stackscope` and pop env var on disable. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `61d4525137`)	2026-06-09 20:28:04 -04:00
Gud Boi	ce6fada2b7	Add `--uds`/`--uds-only` flags to `tractor-reap` Wire up `find_orphaned_uds()` + `reap_uds()` from `_reap` as a new phase-3 UDS sweep in the CLI script. Opt-in via `--uds` (run after proc reap + shm) or `--uds-only` (skip other phases). Also, - consolidate skip-proc-reap logic into a single `skip_proc_reap` bool covering both `--shm-only` and `--uds-only` - extend header docstring + usage examples (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `0996a83655`)	2026-06-09 20:27:26 -04:00
Gud Boi	1bd0c3ab87	Add UDS orphan-sweep helpers + reap fixtures to `_reap` Extend the `_testing._reap` mod with UDS sock-file leak detection + cleanup, complementing the existing shm and subactor-process reaping: - `get_uds_dir()`, `_parse_uds_name()`, `find_orphaned_uds()`, `reap_uds()` — detect `<name>@<pid>.sock` files under `${XDG_RUNTIME_DIR}/tractor/` whose binder pid is dead (including the `1616` registry sentinel). - `_reap_orphaned_subactors` session-scoped autouse fixture: SIGINT lingering subactors, wait, SIGKILL survivors, then sweep orphaned UDS files. - `_track_orphaned_uds_per_test` fn-scoped autouse fixture: snapshot sock-file dir before/after each test, warn + reap new orphans to prevent cascade flakiness under `--tpt-proto=uds`. - `reap_subactors_per_test` opt-in fn-scoped fixture for modules with known-leaky teardown. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `1cdc7fb302`)	2026-06-09 20:27:26 -04:00
Gud Boi	d5b10b9e0c	Allow per-call `start_method`/`loglevel` overrides In `tests/devx/conftest.py::spawn`, refactor the fixture-internal closures so consumer tests can pass explicit `start_method`/`loglevel` to each `_spawn()` invocation rather than only inheriting the fixture- scoped parametrize values. Deats, - promote `set_spawn_method()` and `set_loglevel()` to take their respective values as fn params (vs closing over the fixture-scope vars). - give `_spawn()` `start_method=start_method` and `loglevel: str\|None = None` kwargs so callers override one-off without re-parametrizing the suite. NOTE: this drops the implicit fixture- scoped `loglevel` forward — `_spawn()` callers now must pass `loglevel=...` explicitly. - TODO: figure out how `--ll <level>` should map to the default (currently `None` → uses env-var or tractor default). - add a docstring to `_spawn()` so its role as the consumer-facing closure is obvious from `help()`. Also, - `assert_before()` now returns the `.before` output on success (was `None`); add a one-line docstring describing the new return contract. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `486249d74f`)	2026-06-09 20:27:26 -04:00
Gud Boi	6835391c22	Drop test-local timeouts, +`sync_pause` to dev In `pyproject.toml`, - include the `sync_pause` group from `dev`, so dev installs ship `greenback` for `pause_from_sync()`. Comment out per-test `@pytest.mark.timeout(...)` markers in, - `tests/devx/test_debugger.py` - `tests/discovery/test_registrar.py` - `tests/spawn/test_main_thread_forkserver.py` - `tests/spawn/test_subint_cancellation.py` - `tests/test_advanced_streaming.py` - `tests/test_cancellation.py` The global cap was already dropped (`3c366cac`); these were the leftover per-test caps which now block interactive `pdb` flows under the new spawn backends. In `uv.lock`, - pull `greenback` into the resolved `dev` deps (per the `sync_pause` include above). - catch up the prior `xonsh` editable→PyPI switch (from the `pyproject.toml` `tool.uv.sources` edit). (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `b7115fc875`) (factored: dropped spawn-backend-only paths under tests/spawn/)	2026-06-09 20:27:26 -04:00
Gud Boi	fb409055bf	Honor `TRACTOR_LOGLEVEL`+`TRACTOR_SPAWN_METHOD` env-vars Add env-var overrides inside `._root.open_root_actor()` so devs/test-runs can swap the actor-spawn backend or crank console verbosity without touching application code. In `._root.open_root_actor()`, - read `TRACTOR_LOGLEVEL` early, overriding any caller-passed `loglevel` and stashing an `env_ll_report` to emit once the console log is set up. - pull the `loglevel` fallback (`or _default_loglevel`) and `log.get_console_log()` init up so the env-var report routes through tractor's own logger. - read `TRACTOR_SPAWN_METHOD`, overriding any caller-passed `start_method` and warn-logging when the env-var clobbers an explicit caller value. Wire the same vars through `tests/devx/conftest.py::spawn`, - request the `loglevel` fixture, set both `TRACTOR_LOGLEVEL` and `TRACTOR_SPAWN_METHOD` in `os.environ` before each `pexpect.spawn()` (inherited by the example subproc). - expand `supported_spawners` to include `main_thread_forkserver` and `subint_forkserver` bc example scripts no longer need per-script CLI plumbing. - pop both vars in fixture teardown so a leaked value can't re-route a later in-process tractor test's spawn-backend or loglevel. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `208e7c0926`)	2026-06-09 20:27:26 -04:00
Gud Boi	781abf7558	Flip back to default `pytest` capture for CI (cherry picked from commit `22cdf15b73`)	2026-06-09 20:27:26 -04:00
Gud Boi	23b8a80e15	Add posix-multithreaded-`fork()` explainer doc (cherry picked from commit `532a9834f3`)	2026-06-09 20:27:26 -04:00
Gud Boi	d4e4062bbd	Add todo for running `test_debugger` suite on forkserver spawner (cherry picked from commit `2917b74ba4`)	2026-06-09 20:27:26 -04:00
Gud Boi	e3834f2d95	Route `stackscope` SIGUSR1 onto trio loop Signal handlers fire in a non-trio stack frame; calling `stackscope.extract(recurse_child_tasks=True)` from there only walks the `<init>` task and misses everything inside `async_main`'s nurseries — exactly the part you want to see during a hang. Fix: capture `trio.lowlevel.current_trio_token()` at `enable_stack_on_sig()` time and stash it as a module- level `_trio_token`. The SIGUSR1 handler then dispatches the dump onto the trio loop via `_trio_token.run_sync_soon(_safe_dump_task_tree)`, so `stackscope.extract` runs from a real trio-task context and walks the full nursery tree. Late-binding: pytest's `pytest_configure` calls `enable_stack_on_sig()` outside any `trio.run`, so token capture there is a `RuntimeError` — left at `None`. The runtime re-calls `enable_stack_on_sig()` from inside `async_main` (subactor side) where the token IS available, so subactors get the full-tree path. `dump_tree_on_sig` falls back to a direct call when `_trio_token is None` (parent process pre-trio.run, or signal delivered after `trio.run` returns). `_safe_dump_task_tree()` is a `run_sync_soon`-friendly wrapper that swallows any exception from `dump_task_tree()` — trio prints + crashes on uncaught exceptions in scheduled callbacks; better to log + keep the run alive so the user can re-trigger. Other, - emit `capture-bypass tee: <fpath>` line + `tail -f` hint in the rendered dump header so users know where to find the artifact even when stdio is captured. - swap the inline `f' \|_{actor}'` line for a `_pformat.nest_from_op` rendering of `actor_repr` (matches the rest of the runtime's nested-op style). - log lines on handler install + already-installed branches now note `(trio_token captured: <bool>)` so it's obvious from the log whether the full-tree path is wired. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `2d4995e08d`)	2026-06-09 20:27:26 -04:00
Gud Boi	b0bca7c81e	Add `--enable-stackscope` pytest plugin flag New `--enable-stackscope` CLI flag installs a SIGUSR1 → trio-task-tree-dump handler in pytest itself + every spawned subactor for live stack visibility during hang investigations. Lighter than `--tpdb` (no pdb machinery / tty-lock contention) — pure stack-only triage. Plumbing: - `_testing.pytest.pytest_addoption()` adds the flag. - `_testing.pytest.pytest_configure()` (when flag set): * exports `TRACTOR_ENABLE_STACKSCOPE=1` so fork-children inherit it via environ, * installs the handler in pytest itself via `enable_stack_on_sig()`. - `runtime._runtime.Actor.async_main()` extends the existing `_debug_mode` gate to ALSO fire when `TRACTOR_ENABLE_STACKSCOPE` is in env — so subactors install the same handler at runtime startup. Capture-bypass tee in `dump_task_tree()`: Pytest's default `--capture=fd` swallows `log.devx()` output, making SIGUSR1 dumps invisible right when you need them. Render the dump once to a `full_dump` str, then unconditionally tee to: - `/tmp/tractor-stackscope-<pid>.log` (append-mode, always written) — guaranteed-readable artifact even under CI / `nohup` / no-tty. `tail -f` to follow. - `/dev/tty` (best-effort) — pytest never captures the tty; ignored if device is missing. Other, - squelch the benign `RuntimeWarning` ("coroutine method 'asend'/'athrow' was never awaited") from `stackscope._glue`'s import-time async-gen type introspection so `--enable-stackscope` setup stays quiet. - log msg in the `_runtime` ImportError branch now mentions `--enable-stackscope` alongside debug-mode. Usage, pytest --enable-stackscope -k <hang-test> # in another shell, find the pid + signal: kill -USR1 <pytest-or-subactor-pid> # tail the artifact: tail -f /tmp/tractor-stackscope-<pid>.log (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `5418f2dc3c`)	2026-06-09 20:27:26 -04:00
Gud Boi	ff7acfcbd6	Backend-aware `fail_after` in pub/sub test Mirror `060f7d24`'s pattern (backend-aware timeout in `maybe_expect_raises`) for `test_dynamic_pub_sub`'s hard `trio.fail_after` cap. Fork-based backends pay per-spawn fork+IPC-handshake cost which stacks over `cpus - 1` sequential `n.run_in_actor()` calls; empirically 12s flakes on `main_thread_forkserver` under UDS cross-pytest contention (#451 / #452). Defaults: - `main_thread_forkserver` → 30s - everything else → 12s (unchanged) Hoist the timeout-pick out of the `main()` closure so the dispatch happens once in the trio task rather than re-evaluating per spawn. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `383b0fdd75`)	2026-06-09 20:27:26 -04:00
Gud Boi	6f003d7efd	Backend-aware timeout in `maybe_expect_raises` Default `timeout` from `int = 3` → `int\|None = None`; when unset, pick a backend-aware value. Fork-based backends (`main_thread_forkserver`) need real headroom bc actor spawn + IPC ctx-exit + msg-validation error path is much heavier than under `trio` backend — especially under cross-pytest-stream contention (#451). Defaults: - `main_thread_forkserver` → 30s - everything else → 3s (unchanged) Empirical flake history that motivated 30s as the floor on fork backends (all from `test_basic_payload_spec`): - 3s → all-valid variant flaked w/ `TooSlowError` - 8s → `invalid-return` variant flaked w/ `Cancelled` (surfaced instead of `MsgTypeError` bc the outer `fail_after` fired mid-error-path) - 15s → flaked under cross-pytest-stream contention 30s gives plenty of headroom while still failing-loud on a genuine hang. Callers can opt out by passing an explicit `timeout=` kw. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `060f7d24c4`)	2026-06-09 20:27:26 -04:00
Gud Boi	bbf4fe66e3	Drop global `pytest-timeout` cap from `pyproject.toml` `timeout = 200` was firing via SIGALRM (the default `method='signal'`) which synchronously raises `Failed` in trio's main thread mid-`epoll.poll()`, abandoning trio's runner mid-flight and leaving `GLOBAL_RUN_CONTEXT` half- installed. EVERY subsequent `trio.run()` in the same pytest session then bails with `RuntimeError: Attempted to call run() from inside a run()`. Empirical impact: a session that hits a single 200s hang cascades into 30-40 false-positive failures across every downstream test file that uses `trio.run`. Recent UDS run saw 1 real timeout (`test_unregistered_err_still_relayed`) poison 38 sibling tests with cascade-fails — a debugging nightmare. Same architectural bug we already documented in `tests/test_advanced_streaming.py::test_dynamic_pub_sub` (see its module-level NOTE) — both `pytest-timeout` enforcement modes are incompatible with trio under fork- based spawn backends. Now scoped session-wide. For tests that legitimately need a wall-clock cap, the canonical pattern is `with trio.fail_after(N):` INSIDE the test — trio's own `Cancelled` machinery cleanly unwinds the actor nursery without disturbing global state. For CI: rely on job-level wall-clock timeouts (e.g. GitHub Actions `timeout-minutes`) to abort genuinely-stuck suites. `pyproject.toml` comment block spells this all out so a future contributor doesn't reach back for `timeout =` and re-introduce the bug. ALSO, bump `xonsh` to at least `0.23.0` release. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `3c366cac13`)	2026-06-09 20:27:26 -04:00
Gud Boi	28ad06be8c	Return parent `pid: int` from new `reap_subactors_per_test` fixture (cherry picked from commit `f8178df0fd`)	2026-06-09 20:27:26 -04:00
Gud Boi	2d9a95d13a	Use `trio.fail_after` cap in `test_dynamic_pub_sub` Drop `@pytest.mark.timeout(...)` for the per-test wall-clock cap on `test_dynamic_pub_sub`; rely on `trio.fail_after(12)` inside `main()` instead. Both pytest-timeout enforcement modes are incompatible with trio under fork-based backends: - `method='signal'` (SIGALRM) synchronously raises `Failed` in trio's main thread mid-`epoll.poll()`, leaving `GLOBAL_RUN_CONTEXT` half-installed ("Trio guest run got abandoned") so EVERY subsequent `trio.run()` in the same pytest process bails with `RuntimeError: Attempted to call run() from inside a run()` — full-session poison. - `method='thread'` calls `_thread.interrupt_main()` which can let the KBI escape trio's `KIManager` under fork- cascade teardown races and bubble out of pytest entirely — kills the whole session. `trio.fail_after()` keeps cancellation inside the trio loop: - Raises `TooSlowError` cleanly through the open-nursery's cancel cascade. - Doesn't disturb any out-of-band signal/thread state. - Failure stays scoped to the single test — no cross-test global state corruption either way. Verified empirically: 10 hammer-runs of `test_dynamic_pub_sub` go from 5/10 fail (with global-state poison) to 3/10 fail (no poison, all sibling tests still pass). The ~30% remaining flake rate is a genuine fork-cancel-cascade hang — separate from this fix but no longer contaminates. Module-level NOTE comment explains the rationale so future readers don't re-introduce the bug. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `530160fa69`)	2026-06-09 20:27:26 -04:00
Gud Boi	3315a8a292	Add opt-in `reap_subactors_per_test` fixture Function-scoped, NON-autouse zombie-subactor reaper for modules whose teardown is known-leaky enough to cascade- fail every following test in a session. Sibling to the autouse session-scoped `_reap_orphaned_subactors`. The session-scoped one fires at session end — too late to save tests that follow a hung/leaky test in the suite. The new fixture, opted into via `pytestmark = pytest.mark.usefixtures(...)`, runs between tests in a problem-module so a leftover subactor from test N can't squat on registrar ports / UDS paths / shm segments needed by tests N+1, N+2, ... Intentionally NOT autouse — the fixture's presence on a module signals "this module's teardown leaks; please root-cause instead of relying forever on cleanup". A visibility-vs-convenience trade picked in favor of the former. Apply to `tests/test_infected_asyncio.py` since both recent full-suite runs (parallel-tpt-proto + TCP-only) showed the cascade originating in this file's KBI- and SIGINT-flavored tests under `main_thread_forkserver`. Module-comment names the specific offenders so future de-flake work has a starting point. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `b376eb0332`)	2026-06-09 20:27:26 -04:00
Gud Boi	b7056d8da9	Fix `_testing.addr.get_rando_addr` cross-process collisions Previously the random port was a default-arg expression (`_rando_port: str = random.randint(1000, 9999)`) — evaluated ONCE at module import time, making it a per-process singleton. Two parallel pytest sessions had a 1/9000 birthday-pair chance of picking the same port; when it hit, every `reg_addr`-using test in BOTH runs would cascade-fail with "Address already in use". Switch to per-call `random.randint()` salted with `os.getpid()` so: - within one session: two calls return distinct ports — e.g. `test_tpt_bind_addrs::bind-subset-reg` now actually gets two different reg addrs on the TCP backend (it was silently duplicating before), - across parallel sessions: pid salt biases each process's port choices apart, making cross-run collisions vanishingly rare. Drop the bogus `: str` annotation (was always `int`). UDS already gets per-process isolation via `UDSAddress.get_random()`'s `@<pid>` socket-path suffix, so no change needed there. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `7c5dd4d033`)	2026-06-09 20:27:26 -04:00
Gud Boi	5b08c6b034	Sweep `subint_forkserver` → `main_thread_forkserver` in code After the variant-1 / variant-2 backend split, update remaining string-match refs to the variant-1 backend so user-visible gates + skip-marks + comments name the working backend correctly: - `tractor._root._DEBUG_COMPATIBLE_BACKENDS`: include `main_thread_forkserver`, drop the stub-only `subint_forkserver` entry. - `tests/test_spawning.py::test_loglevel_propagated_to_subactor`: capfd-skip flips to `main_thread_forkserver`. - `tests/test_infected_asyncio.py::test_sigint_closes_lifetime_stack`: xfail-condition flips to `main_thread_forkserver`. - `tests/test_shm.py`: drop stale "broken on `main_thread_forkserver`" reason-text since the `mp.SharedMemory(track=False)` + resource-tracker monkey-patch in `.ipc._mp_bs` makes the tests pass; the skip-mark only fires on plain `subint` now. - Comment / docstring sweep: `runtime._state`, `runtime._runtime`, `_testing.pytest`, `_subint.py`, `pyproject.toml`, `test_cancellation.py`, `test_registrar.py` — refs to variant-1 backend updated. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `205382a39b`) (factored: dropped spawn-backend-only path: tractor/spawn/_subint.py)	2026-06-09 20:27:26 -04:00
Gud Boi	1f7403abc2	Wire `reg_addr` into `test_context_stream_semantics` Same wire-up pattern as the prior `test_dynamic_pub_sub` commit: each test that already pulled in `debug_mode` now also pulls in `reg_addr` and passes `registry_addrs=[reg_addr]` into `tractor.open_nursery()`, so the suite's standard registry-addr conventions apply. Tests touched: - `test_started_misuse` - `test_simple_context` - `test_parent_cancels` - `test_one_end_stream_not_opened` - `test_maybe_allow_overruns_stream` - `test_ctx_with_self_actor` (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `66f1941f46`)	2026-06-09 20:27:26 -04:00
Gud Boi	ed00b75a7b	Wire `test_dynamic_pub_sub` to standard fixtures Pull in the `reg_addr`, `debug_mode`, and `test_log` fixtures so this test follows the same conventions as the rest of the suite: - pass `registry_addrs=[reg_addr]` + `debug_mode` into `tractor.open_nursery()` (so `--tpdb` etc work). - after the `pytest.raises` block, add `assert err` + `test_log.exception('Timed out AS EXPECTED')` so the expected timeout is logged explicitly instead of swallowed. Also, - drop whitespace-only blank lines around the `subs` param of `consumer()` and `ctx` param of `one_task_streams_and_one_handles_reqresp()`. - promote `test_sigint_both_stream_types`'s one-line docstring to multi-line form. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `9b05f659b3`)	2026-06-09 20:27:26 -04:00
Gud Boi	8daf8eeaca	Bump `test_stale_entry_is_deleted`'s timeout to 30 Seems that when run in-suite it delays more then the so-measured "happy path" timing; better to have no suite-global interruption then asserting a fast single test's run. (cherry picked from commit `65fcfbf224`)	2026-06-09 20:27:26 -04:00
Gud Boi	8598da2d3a	Add `--shm` orphan sweep to `tractor-reap` Since `tractor.ipc._mp_bs.disable_mantracker()` turns off `mp.resource_tracker` entirely (see the conc-anal doc `subint_forkserver_mp_shared_memory_issue.md`), a hard-crashing actor can leave `/dev/shm/<key>` segments that nothing else GCs. New `tractor-reap` phase 2 sweeps them. Deats, - `tractor/_testing/_reap.py`: add `find_orphaned_shm()` + `reap_shm()` helpers. Match criteria: regular file under `/dev/shm`, owned by current uid, AND no live proc has it open (mmap'd or fd-held). In-use enumeration via `psutil.Process.memory_maps()` + `.open_files()` — xplatform, kernel-canonical (same answer `lsof` would give), no reliance on tractor-specific shm-key naming. - `_ensure_shm_supported()` guard: helpers raise `NotImplementedError` outside Linux/FreeBSD bc macOS POSIX shm has no fs-visible path (`shm_open` only) and Windows is a different story. - `scripts/tractor-reap`: new `--shm` (run after process reap) and `--shm-only` (skip process phase) flags. `-n` dry-runs both phases. Exit code is `1` if either phase had survivors/errors. - `pyproject.toml` + `uv.lock`: add `psutil>=7.0.0` to the `testing` dep group; lazy-imported in `_reap.py` so the process-reap path stays import-clean without it. Also, - doc `--shm` in `.claude/skills/run-tests/SKILL.md` (new section 10c) — covers match criteria + the preservation guarantee for unrelated apps. - flip mitigation status in `subint_forkserver_mp_shared_memory_issue.md` from "could extend `tractor-reap`" to "implemented", with a note that callers should still UUID-pin shm keys to avoid cross-session collisions. Verified locally vs 81 in-use segments held by `piker`, `lttng-ust-`, `aja-shm-` — all preserved; only the genuinely-orphaned tractor segments got unlinked. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `4f12d69b41`) (factored: dropped subint_forkserver conc-anal doc update)	2026-06-09 20:27:26 -04:00
Gud Boi	2e2977b74c	Fix `SharedMemory` under `subint_forkserver` Implements the resolution described in c99d475d's `subint_forkserver_mp_shared_memory_issue.md` (now updated with the resolution post-mortem). Two-part fix that side-steps `mp.resource_tracker` entirely rather than try to make it fork-safe — turns out that's both simpler AND more correct given tractor already SC-manages allocation lifetimes. Deats, - `tractor/ipc/_mp_bs.py::disable_mantracker()`: drop the `platform.python_version_tuple()[:-1] >= ('3', '13')` branch — patches now run unconditionally: * monkey-patch `mp.resource_tracker. _resource_tracker` to a no-op `ManTracker` subclass (empty `register` / `unregister` / `ensure_running`). * return `partial(SharedMemory, track=False)` for the per-allocation opt-out. * belt + suspenders: even if something dodges the wrapper, the singleton can't talk to the inherited (broken) parent fd. - `tractor/ipc/_shm.py::open_shm_list()`: drop the 3.13+ conditional skip of the unlink-callback; install a `try_unlink()` wrapper that swallows `FileNotFoundError` (sibling-already-cleaned race in shared-key setups). Without `mp.resource_tracker` doing it for us, we own the unlink — `actor. lifetime_stack` is the right place since tractor already controls actor lifecycle. - `tests/test_shm.py`: uncomment-out `subint_forkserver` from the module-level skip- list (tests pass now). Inline comment cross-refs the two `_mp_bs` / `_shm` workarounds. - `ai/conc-anal/subint_forkserver_mp_shared_memory_ issue.md`: heavy rewrite — flips status from "open / unresolvable in tractor" to "resolved, kept as decision record". Adds Resolution section, "Why this is the right call" rationale (mp tracker is widely criticized; tractor already owns lifecycle), trade-offs (crash-leaked segments, lost mp leak warning), verification (7 passed under both `subint_forkserver` and `trio` backends), and upstream issue links (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `aa3e230926`) (factored: dropped subint_forkserver conc-anal doc update)	2026-06-09 20:27:26 -04:00
Gud Boi	da0c457ff7	Document `SharedMemory` × `subint_forkserver` incompat New `ai/conc-anal/` doc: `mp.SharedMemory` is fork-without-exec unsafe — child inherits parent's `resource_tracker` fd → EBADF on first shm op; leaked `/shm_list` cascades `FileExistsError` across parametrize variants. Canonical CPython issue class, NOT a tractor bug. Includes two longer-term mitigation paths (reset inherited tracker fd vs migrate off `mp.shared_memory`). Also, update `tests/test_shm.py`: - comment out `subint_forkserver` from skip list - rewrite reason with precise failure-mode descriptions + link to the analysis doc (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `c99d475d03`) (factored: dropped spawn-backend-only paths: ai/conc-anal/subint_forkserver_mp_shared_memory_issue.md)	2026-06-09 20:27:26 -04:00
Gud Boi	352adc64a8	Add `tractor-reap` CLI + document auto-reap New `scripts/tractor-reap` CLI wraps the `_testing._reap` mod for manual zombie-subactor cleanup after crashed pytest sessions. Two modes: - orphan-mode (default): finds PPid==1 procs with cwd matching repo root + `python` in cmdline. - descendant-mode (`--parent <pid>`): scoped sweep under a still-live supervisor. SC-polite: SIGINT with bounded grace window (default 3s) before escalating to SIGKILL. Exit code signals whether escalation was needed (useful for CI health-checks). Also, document both the auto-reap fixture and the CLI in `/run-tests` SKILL.md (section 10). (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `6d76b60404`)	2026-06-09 20:27:26 -04:00
Gud Boi	2df05e8225	Add `_testing._reap` + auto-reap fixture Zombie-subactor cleanup for the test suite, SC-polite discipline (`SIGINT` first, bounded grace, `SIGKILL` only on survivors). Two parts: a shared reaper module + an autouse session-end fixture that runs it. Deats, - new `tractor/_testing/_reap.py` (+230 LOC) — Linux- only reaper using `/proc/<pid>/{status,cwd,cmdline}` inspection. Two detection modes: - `find_descendants(parent_pid)` for the in-session case (PPid-direct-match while pytest is still alive). - `find_orphans(repo_root)` for the CLI / post- mortem case (`PPid==1` reparented to init + `cwd` filter to repo root + `python` cmdline filter). - `reap(pids, *, grace=3.0, poll=0.25)` does the signal ladder: SIGINT all, poll up to `grace` for exit, SIGKILL any survivors. Returns `(signalled, killed)` for caller-side reporting. - new `_reap_orphaned_subactors` session-scoped autouse fixture in `tractor/_testing/pytest.py` — after `yield`, runs `find_descendants(os.getpid())` + `reap(...)` so each pytest session leaves no surviving forks. - companion CLI scaffolding lives at `scripts/tractor-reap` (separate commit) for the pytest-died-mid-session case where the in-session fixture didn't get to run. Also, - promote `from tractor.spawn._spawn import SpawnMethodKey` to module-top in `pytest.py` (was inline-imported inside `pytest_generate_tests`), and reuse it in `pytest_collection_modifyitems` to assert each `skipon_spawn_backend` mark arg is a valid spawn-method literal — catches typos at collection time. - inline `# ?TODO` flags running these through the `try_set_backend` checker for stronger validation. Cross-refs `feedback_sc_graceful_cancel_first.md` for the SIGINT-before-SIGKILL discipline rationale. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `eae478f3d5`)	2026-06-09 20:27:26 -04:00
Gud Boi	13053f9cbe	Skip `test_loglevel_propagated_to_subactor` on subint forkserver too (cherry picked from commit `2ca0f41e61`)	2026-06-09 20:22:23 -04:00
Gud Boi	a199aa5096	Wire `reg_addr` through infected-asyncio tests Continues the hygiene pattern from `de601676` (cancel tests) into `tests/test_infected_asyncio.py`: many tests here were calling `tractor.open_nursery()` w/o `registry_addrs=[reg_addr]` and thus racing on the default `:1616` registry across sessions. Thread the session-unique `reg_addr` through so leaked or slow-to-teardown subactors from a prior test can't cross-pollute. Deats, - add `registry_addrs=[reg_addr]` to `open_nursery()` calls in suite where missing. - `test_sigint_closes_lifetime_stack`: - add `reg_addr`, `debug_mode`, `start_method` fixture params - `delay` now reads the `debug_mode` param directly instead of calling `tractor.debug_mode()` (fires slightly earlier in the test lifecycle) - sanity assert `if debug_mode: assert tractor.debug_mode()` after nursery open - new print showing SIGINT target (`send_sigint_to` + resolved pid) - catch `trio.TooSlowError` around `ctx.wait_for_result()` and conditionally `pytest.xfail` when `send_sigint_to == 'child' and start_method == 'subint_forkserver'` — the known orphan-SIGINT limitation tracked in `ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md` - parametrize id typo fix: `'just_trio_slee'` → `'just_trio_sleep'` (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `b350aa09ee`)	2026-06-09 20:22:23 -04:00
Gud Boi	ba2e474d9d	Import-or-skip `.devx.` tests requiring `greenback` Which is for sure true on py3.14+ rn since `greenlet` didn't want to build for us (yet). (cherry picked from commit `d6e70e9de4`)	2026-06-09 20:22:23 -04:00
Gud Boi	e4c7ac34db	Default `pytest` to use `--capture=sys` Lands the capture-pipe workaround from the prior cluster of diagnosis commits: switch pytest's `--capture` mode from the default `fd` (redirects fd 1,2 to temp files, which fork children inherit and can deadlock writing into) to `sys` (only `sys.stdout` / `sys.stderr` — fd 1,2 left alone). Trade-off documented inline in `pyproject.toml`: - LOST: per-test attribution of raw-fd output (C-ext writes, `os.write(2, ...)`, subproc stdout). Still goes to terminal / CI capture, just not per-test-scoped in the failure report. - KEPT: `print()` + `logging` capture per-test (tractor's logger uses `sys.stderr`). - KEPT: `pytest -s` debugging behavior. This allows us to re-enable `test_nested_multierrors` without skip-marking + clears the class of pytest-capture-induced hangs for any future fork-based backend tests. Deats, - `pyproject.toml`: `'--capture=sys'` added to `addopts` w/ ~20 lines of rationale comment cross-ref'ing the post-mortem doc - `test_cancellation`: drop `skipon_spawn_backend('subint_forkserver')` from `test_nested_ multierrors` — no longer needed. * file-level `pytestmark` covers any residual. - `tests/spawn/test_subint_forkserver.py`: orphan-SIGINT test's xfail mark loosened from `strict=True` to `strict=False` + reason rewritten. * it passes in isolation but is session-env-pollution sensitive (leftover subactor PIDs competing for ports / inheriting harness FDs). * tolerate both outcomes until suite isolation improves. - `test_shm`: extend the existing `skipon_spawn_backend('subint', ...)` to also skip `'subint_forkserver'`. * Different root cause from the cancel-cascade class: `multiprocessing.SharedMemory`'s `resource_tracker` + internals assume fresh- process state, don't survive fork-without-exec cleanly - `tests/discovery/test_registrar.py`: bump timeout 3→7s on one test (unrelated to forkserver; just a flaky-under-load bump). - `tractor.spawn._subint_forkserver`: inline comment-only future-work marker right before `_actor_child_main()` describing the planned conditional stdout/stderr-to-`/dev/null` redirect for cases where `--capture=sys` isn't enough (no code change — the redirect logic itself is deferred). EXTRA NOTEs ----------- The `--capture=sys` approach is the minimum- invasive fix: just a pytest ini change, no runtime code change, works for all fork-based backends, trade-offs well-understood (terminal-level capture still happens, just not pytest's per-test attribution of raw-fd output). (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `4c133ab541`) (factored: dropped spawn-backend-only paths: tests/spawn/test_subint_forkserver.py + tractor/spawn/_subint_forkserver.py; the xfail-loosening bullet above no longer applies)	2026-06-09 20:22:23 -04:00
Gud Boi	45c442060b	Codify capture-pipe hang lesson in skills Encode the hard-won lesson from the forkserver cancel-cascade investigation into two skill docs so future sessions grep-find it before spelunking into trio internals. Deats, - `.claude/skills/conc-anal/SKILL.md`: - new "Unbounded waits in cleanup paths" section — rule: bound every `await X.wait()` in cleanup paths with `trio.move_on_after()` unless the setter is unconditionally reachable. Recent example: `ipc_server.wait_for_no_more_peers()` in `async_main`'s finally (was unbounded, deadlocked when any peer handler stuck) - new "The capture-pipe-fill hang pattern" section — mechanism, grep-pointers to the existing `conftest.py` guards (`tests/conftest .py:258`, `:316`), cross-ref to the full post-mortem doc, and the grep-note: "if a multi-subproc tractor test hangs, `pytest -s` first, conc-anal second" - `.claude/skills/run-tests/SKILL.md`: new "Section 9: The pytest-capture hang pattern (CHECK THIS FIRST)" with symptom / cause / pre-existing guards to grep / three-step debug recipe (try `-s`, lower loglevel, redirect stdout/stderr) / signature of this bug vs. a real code hang / historical reference Cost several investigation sessions before the capture-pipe issue surfaced — it was masked by deeper cascade deadlocks. Once the cascades were fixed, the tree tore down enough to generate pipe-filling log volume. Lesson: grep this pattern first when any multi-subproc tractor test hangs under default pytest but passes with `-s`. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `4106ba73ea`)	2026-06-09 20:21:58 -04:00
Gud Boi	828df7df79	Update `subint_forkserver` skip reason: capture-pipe Refresh the `test_nested_multierrors` skip-mark reason to the final diagnosis: the hang is pytest's default `--capture=fd` pipe filling from high-volume subactor traceback output inherited via fds 1,2 in fork children — `pytest -s` passes cleanly. Records the fix direction (redirect child stdio to `/dev/null` in the fork-child prelude) for whoever lands the backend. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `eceed29d4a`) (factored: kept only the tests/test_cancellation.py skip-reason update of "Pin forkserver hang to pytest `--capture=fd`"; dropped the subint conc-anal doc + tests/spawn/test_subint_forkserver.py)	2026-06-09 20:21:58 -04:00
Gud Boi	41813ac9a0	Bound peer-clear wait in `async_main` finally Fifth diagnostic pass pinpointed the hang to `async_main`'s finally block — every stuck actor reaches `FINALLY ENTER` but never `RETURNING`. Specifically `await ipc_server.wait_for_no_more_ peers()` never returns when a peer-channel handler is stuck: the `_no_more_peers` Event is set only when `server._peers` empties, and stuck handlers keep their channels registered. Wrap the call in `trio.move_on_after(3.0)` + a warning-log on timeout that records the still- connected peer count. 3s is enough for any graceful cancel-ack round-trip; beyond that we're in bug territory and need to proceed with local teardown so the parent's `_ForkedProc.wait()` can unblock. Defensive-in-depth regardless of the underlying bug — a local finally shouldn't block on remote cooperation forever. Verified: with this fix, ALL 15 actors reach `async_main: RETURNING` (up from 10/15 before). Test still hangs past 45s though — there's at least one MORE unbounded wait downstream of `async_main`. Candidates enumerated in the doc update (`open_root_actor` finally / `actor.cancel()` internals / trio.run bg tasks / `_serve_ipc_eps` finally). Skip-mark stays on `test_nested_multierrors[subint_forkserver]`. Also updates `subint_forkserver_test_cancellation_leak_issue.md` with the new pinpoint + summary of the 6-item investigation win list: 1. FD hygiene fix (`_close_inherited_fds`) — orphan-SIGINT closed 2. pidfd-based `_ForkedProc.wait` — cancellable 3. `_parent_chan_cs` wiring — shielded parent-chan loop now breakable 4. `wait_for_no_more_peers` bound — THIS commit 5. Ruled-out hypotheses: tree-kill missing, stuck socket recv, capture-pipe fill (all wrong) 6. Remaining unknown: at least one more unbounded wait in the teardown cascade above `async_main` (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `e312a68d8a`) (factored: dropped subint_forkserver conc-anal doc update)	2026-06-09 20:21:26 -04:00
Gud Boi	1d70d33d9a	Claude-perms: ensure /commit-msg files can be written! (cherry picked from commit `76d12060aa`)	2026-06-09 20:20:52 -04:00
Gud Boi	555f64fdf2	Skip-mark `subint_forkserver` nested-multierror hang Skip-mark the still-hanging `test_nested_multierrors[subint_forkserver]` via `@pytest.mark.skipon_spawn_backend('subint_forkserver', reason=...)` so it stops blocking the test matrix while the remaining bug is being chased. The mark is an inert no-op until that (in-dev) backend lands. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `506617c695`) (factored: kept only the tests/test_cancellation.py skip-mark; dropped the subint_forkserver conc-anal doc update)	2026-06-09 20:20:29 -04:00
Gud Boi	d9a99d9c48	Break parent-chan shield during teardown Completes the nested-cancel deadlock fix started in `0cd0b633` (fork-child FD scrub) and `fe540d02` (pidfd- cancellable wait). The remaining piece: the parent- channel `process_messages` loop runs under `shield=True` (so normal cancel cascades don't kill it prematurely), and relies on EOF arriving when the parent closes the socket to exit naturally. Under exec-spawn backends (`trio_proc`, mp) that EOF arrival is reliable — parent's teardown closes the handler-task socket deterministically. But fork- based backends like `subint_forkserver` share enough process-image state that EOF delivery becomes racy: the loop parks waiting for an EOF that only arrives after the parent finishes its own teardown, but the parent is itself blocked on `os.waitpid()` for THIS actor's exit. Mutual wait → deadlock. Deats, - `async_main` stashes the cancel-scope returned by `root_tn.start(...)` for the parent-chan `process_messages` task onto the actor as `_parent_chan_cs` - `Actor.cancel()`'s teardown path (after `ipc_server.cancel()` + `wait_for_shutdown()`) calls `self._parent_chan_cs.cancel()` to explicitly break the shield — no more waiting for EOF delivery, unwinding proceeds deterministically regardless of backend - inline comments on both sites explain the mutual- wait deadlock + why the explicit cancel is backend-agnostic rather than a forkserver-specific workaround With this + the prior two fixes, the `subint_forkserver` nested-cancel cascade unwinds cleanly end-to-end. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `8ac3dfeb85`)	2026-06-09 20:19:56 -04:00
Gud Boi	222784ccc8	Use SIGINT-first ladder in `run-tests` cleanup The previous cleanup recipe went straight to SIGTERM+SIGKILL, which hides bugs: tractor is structured concurrent — `_trio_main` catches SIGINT as an OS-cancel and cascades `Portal.cancel_actor` over IPC to every descendant. So a graceful SIGINT exercises the actual SC teardown path; if it hangs, that's a real bug to file (the forkserver `:1616` zombie was originally suspected to be one of these but turned out to be a teardown gap in `_ForkedProc.kill()` instead). Deats, - step 1: `pkill -INT` scoped to `$(pwd)/py*` — no sleep yet, just send the signal - step 2: bounded wait loop (10 × 0.3s = ~3s) using `pgrep` to poll for exit. Loop breaks early on clean exit - step 3: `pkill -9` only if graceful timed out, w/ a logged escalation msg so it's obvious when SC teardown didn't complete - step 4: same SIGINT-first ladder for the rare `:1616`-holding zombie that doesn't match the cmdline pattern (find PID via `ss -tlnp`, then `kill -INT NNNN; sleep 1; kill -9 NNNN`) - steps 5-6: UDS-socket `rm -f` + re-verify unchanged Goal: surface real teardown bugs through the test- cleanup workflow instead of papering over them with `-9`. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `70d58c4bd2`)	2026-06-09 20:19:56 -04:00
Gud Boi	9c3fc19f35	Wire `reg_addr` through leaky cancel tests Stopgap companion to `d0121960` (`subint_forkserver` test-cancellation leak doc): five tests in `tests/test_cancellation.py` were running against the default `:1616` registry, so any leaked `subint-forkserv` descendant from a prior test holds the port and blows up every subsequent run with `TooSlowError` / "address in use". Thread the session-unique `reg_addr` fixture through so each run picks its own port — zombies can no longer poison other tests (they'll only cross-contaminate whatever happens to share their port, which is now nothing). Deats, - add `reg_addr: tuple` fixture param to: - `test_cancel_infinite_streamer` - `test_some_cancels_all` - `test_nested_multierrors` - `test_cancel_via_SIGINT` - `test_cancel_via_SIGINT_other_task` - explicitly pass `registry_addrs=[reg_addr]` to the two `open_nursery()` calls that previously had no kwargs at all (in `test_cancel_via_SIGINT` and `test_cancel_via_SIGINT_other_task`) - add bounded `@pytest.mark.timeout(7, method='thread')` to `test_nested_multierrors` so a hung run doesn't wedge the whole session Still doesn't close the real leak — the `subint_forkserver` backend's `_ForkedProc.kill()` is PID-scoped not tree-scoped, so grandchildren survive teardown regardless of registry port. This commit is just blast-radius containment until that fix lands. See `ai/conc-anal/ subint_forkserver_test_cancellation_leak_issue.md`. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `1af2121057`)	2026-06-09 20:19:56 -04:00
Gud Boi	1ebe15db3b	Add zombie-actor check to `run-tests` skill Fork-based backends (esp. `subint_forkserver`) can leak child actor processes on cancelled / SIGINT'd test runs; the zombies keep the tractor default registry (`127.0.0.1:1616` / `/tmp/registry@1616.sock`) bound, so every subsequent session can't bind and 50+ unrelated tests fail with the same `TooSlowError` / "address in use" signature. Document the pre-flight + post-cancel check as a mandatory step 4. Deats, - primary signal: `ss -tlnp \| grep ':1616'` for a bound TCP registry listener — the authoritative check since :1616 is unique to our runtime - `pgrep -af` scoped to `$(pwd)/py[0-9]/bin/python. _actor_child_main\|subint-forkserv` for leftover actor/forkserver procs — scoped deliberately so we don't false-flag legit long-running tractor- embedding apps like `piker` - `ls /tmp/registry@.sock` for stale UDS sockets - scoped cleanup recipe (SIGTERM + SIGKILL sweep using the same `$(pwd)/py` pattern, UDS `rm -f`, re-verify) plus a fallback for when a zombie holds :1616 but doesn't match the pattern: `ss -tlnp` → kill by PID - explicit false-positive warning calling out the `piker` case (`~/repos/piker/py*/bin/python3 -m tractor._child ...`) so a bare `pgrep` doesn't lead to nuking unrelated apps Goal: short-circuit the "spelunking into test code" rabbit-hole when the real cause is just a leaked PID from a prior session, without collateral damage to other tractor-embedding projects on the same box. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `d093c31979`)	2026-06-09 20:19:56 -04:00
Gud Boi	5ab2739b40	Enable `debug_mode` for `subint_forkserver` The `subint_forkserver` backend's child runtime is trio-native (uses `_trio_main` + receives `SpawnSpec` over IPC just like `trio`/`subint`), so `tractor.devx.debug._tty_lock` works in those subactors. Wire the runtime gates that historically hard-coded `_spawn_method == 'trio'` to recognize this third backend. Deats, - new `_DEBUG_COMPATIBLE_BACKENDS` module-const in `tractor._root` listing the spawn backends whose subactor runtime is trio-native (`'trio'`, `'subint_forkserver'`). Both the enable-site (`_runtime_vars['_debug_mode'] = True`) and the cleanup-site reset key. off the same tuple — keep them in lockstep when adding backends - `open_root_actor`'s `RuntimeError` for unsupported backends now reports the full compatible-set + the rejected method instead of the stale "only `trio`" msg. - `runtime._runtime.Actor._from_parent`'s SpawnSpec-recv gate adds `'subint_forkserver'` to the existing `('trio', 'subint')` tuple — fork child-side runtime receives the same SpawnSpec IPC handshake as the others. - `subint_forkserver_proc` child-target now passes `spawn_method='subint_forkserver'` (was hard-coded `'trio'`) so `Actor.pformat()` / log lines reflect the actual parent-side spawn mechanism rather than masquerading as plain `trio`. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit `8bcbe730bf`)	2026-06-09 20:19:56 -04:00
Gud Boi	fc049abe2a	Refactor `_runtime_vars` into pure get/set API Resetting `_runtime_vars` post-(forking-)spawn was previously only possible via direct mutation of `_state._runtime_vars` from an external module + an inline default dict duplicating the `_state.py`-internal defaults. Split the access surface into a pure getter + explicit setter so such a reset call site becomes a one-liner composition: `set_runtime_vars(get_runtime_vars(clear_values=True))`. Deats `tractor/runtime/_state.py`, - extract initial values into a module-level `_RUNTIME_VARS_DEFAULTS: dict[str, Any]` constant; the live `_runtime_vars` is now initialised from `dict(_RUNTIME_VARS_DEFAULTS)` - `get_runtime_vars()` grows a `clear_values: bool = False` kwarg. When True, returns a fresh copy of `_RUNTIME_VARS_DEFAULTS` instead of the live dict — still a pure read, never mutates anything - new `set_runtime_vars(rtvars: dict \| RuntimeVars)` — atomic replacement of the live dict's contents via `.clear()` + `.update()`, so existing references to the same dict object remain valid. Accepts either the historical dict form or the `RuntimeVars` struct (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit 7804a9fe57693dd5e15bee6a08e7d2fa14b6a98a) (factored: kept only the tractor/runtime/_state.py part; dropped tractor/spawn/_subint_forkserver.py call-site rewire)	2026-06-09 20:19:56 -04:00

1 2 3 4 5 ...

2613 Commits (6d94a67251de9b0a27705c5671800e01caf12951) All Branches Search

2613 Commits (6d94a67251de9b0a27705c5671800e01caf12951)

All Branches