All `daemon` fixture consumers are discovery-
protocol tests now living under `tests/discovery/`.
Move the fixture, its `_wait_for_daemon_ready`
helper, and `test_multi_program.py` into that subdir
so scope matches usage.
Also,
- add `pytestmark` for `track_orphaned_uds_per_test`
+ `detect_runaway_subactors_per_test` to `test_multi_program` as
regression net.
- drop now-unused `_PROC_SPAWN_WAIT` + `socket` import from root
conftest.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit c4082be876)
First draft at resolving,
https://github.com/goodboy/tractor/issues/424
`tests.conftest.py.daemon()` previously used a blind
`time.sleep(_PROC_SPAWN_WAIT + uds_bonus + ci_bonus)` to "wait for the
daemon to come up" before yielding the proc to the test.
Two problems:
1. **Racy under load** — sleep is fixed at design time; loaded boxes
/ cold starts / fork-spawn cost spikes blow past it, leading to
`ConnectionRefusedError` /`OSError: connect failed` flakes in
`test_register_duplicate_name`.
2. **Wasteful when daemon comes up fast** — happy-path pays the FULL
sleep regardless. ~3s of dead time per fixture invocation, ~10-20s
per full suite run.
Replace with `_wait_for_daemon_ready()` — active poll via stdlib
`socket.create_connection` (TCP) or `socket.connect` (UDS) on the
daemon's bind addr, with 50ms backoff and a 10s/15s deadline (CI gets
extra headroom). Daemon-died-during-startup early-exit catches the case
where `_PROC_SPAWN_WAIT` was silently masking daemon startup crashes.
Why stdlib `socket` (Option 2 from the conc-anal doc) instead of
`tractor`'s own `_root.ping_tpt_socket` closure or trio?
- `tractor.run_daemon()` doesn't return from bootstrap until the runtime
is fully ready to handle IPC, so probing listen-side acceptance is
sufficient.
- no need to do the full IPC handshake just to validate readiness.
Sidesteps the `trio.run()` bootstrap cost (~50ms) per fixture too.
`claude`'s verification: 10/10 runs of `tests/test_multi_program.py`
pass on both `--tpt-proto=tcp` and `--tpt-proto=uds`. Per-test wall-time
`test_register_duplicate_name`: 4.31s → 1.10s. Full file: ~12s → 3.27s
per transport.
Doc-tracked at:
`ai/conc-anal/test_register_duplicate_name_daemon_connect_race_issue.md`
Future work — session-scoped trio runtime in a bg thread to share
fixture-side trio operations across many fixtures (currently overkill
for the one fixture that needs it).
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit ec8c4659c4)
Use `is_forking_spawner` fixture + gate spawner-
specific expect patterns in nested-error and daemon
tests. Add `set_fork_aware_capture` to multi-sub
tests that need capture-mode awareness.
Deats,
- replace `start_method` param with `is_forking_spawner` bool fixture.
- bump inter-send delay to 0.1s for IPC stability under fork backends.
- gate `bdb.BdbQuit` + relay-uid patterns behind `not
is_forking_spawner` (not visible under capsys).
- add `expect(child, EOF)` to confirm clean exit.
- switch caught exc from `AssertionError` to `ValueError` in daemon
test.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 9031605807)
In top level `daemon`-fixture that is..
Use a local `bg_daemon_spawn_delay` instead of
mutating the module-level `_PROC_SPAWN_WAIT` —
previously each `daemon` fixture invocation would
permanently add 1.6s (UDS) or 1s (CI) to the
global, inflating delays across the session.
Also, emit a `test_log.warning()` when verbose
loglevel is silently reduced to `'info'`.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit c4885f9d99)
With a seminal patch fixing `trio`'s `WakeupSocketpair.drain()` which
can busy-loop due to lack of handling `EOF`.
New `tractor.trionics.patches` subpkg housing defensive monkey-patches
for upstream `trio` bugs we've encountered while running `tractor`
— particularly as of recent, fork-survival edge cases that haven't been
filed/fixed upstream yet. Each patch is idempotent, version-gated via
`is_needed()`, and carries a `# REMOVE WHEN:` marker pointing at the
upstream release whose adoption allows deletion.
Subpkg layout + per-patch contract documented in
`tractor/trionics/patches/README.md` — `apply()` / `is_needed()`
/ `repro()` API, registry pattern via `_PATCHES` in `__init__.py`,
single-call entry point `apply_all()`.
First patch, `_wakeup_socketpair`:
- `trio`'s `WakeupSocketpair.drain()` loops on `recv(64KB)` and exits
ONLY on `BlockingIOError`, NEVER on `recv() == b''` (peer-closed FIN).
- under `fork()`-spawning backends the COW-inherited socketpair fds
& `_close_inherited_fds()` teardown can leave a `WakeupSocketpair`
instance whose write-end is closed, and `drain()` then **spins forever
in C with no Python checkpoints**,
- this obviously burns 100% CPU and no signal delivery.
Standalone repro:
from trio._core._wakeup_socketpair import WakeupSocketpair
ws = WakeupSocketpair()
ws.write_sock.close()
ws.drain() # spins forever
Patch is one-line — break the drain loop on b'' EOF.
Manifested as two distinct test failures:
- `tests/test_multi_program.py::test_register_duplicate_name` hung at
100% CPU on the busy-loop directly (fork child's worker thread)
- `tests/test_infected_asyncio.py::test_aio_simple_error` Mode-A
deadlock — busy-loop wedged trio's scheduler inside `start_guest_run`,
both threads parked in `epoll_wait`, no TCP connect-back to parent
ever happened.
Same patch fixes both. Restored 99.7% pass rate on full
suite under `--spawn-backend=main_thread_forkserver`
(was hanging indefinitely before).
Wired into `tractor._child._actor_child_main` via `apply_all()` BEFORE
any trio runtime init. Harmless on non-fork backends.
Conc-anal write-ups, including strace + py-spy evidence:
- `ai/conc-anal/trio_wakeup_socketpair_busy_loop_under_fork_issue.md`
- `ai/conc-anal/infected_asyncio_under_main_thread_forkserver_hang_issue.md`
Regression tests in `tests/trionics/test_patches.py`: each test asserts
(a) the bug exists pre-patch (or is fixed upstream — skip cleanly), (b)
the patch fixes it with a SIGALRM wall-clock cap so a regression hangs
loud instead of silently.
TODO:
- [ ] file the upstream `python-trio/trio` issue + PR.
- [ ] use the `repro()` callable in `_wakeup_socketpair.py` IS the issue
body's evidence section.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 0ef549fadb)
(factored: dropped spawn-backend-only paths: ai/conc-anal/infected_asyncio_under_main_thread_forkserver_hang_issue.md)
Under `main_thread_forkserver` the bootstrapping
hook switches to `--capture=sys`, so subactor
fd-level output (tree dumps, zombie-reaper msgs)
isn't captured per-test by pexpect. Gate those
expects behind a `no_capfd` check so the test
passes on both capture modes.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 5a9926fc32)
Only override `tractor.log._default_loglevel` when
the flag is explicitly passed — lets per-spawn and
per-example `loglevel` kwargs take effect instead
of being clobbered by the hard-coded `'ERROR'`
default.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 72a0465c52)
Pass explicit `loglevel` to `spawn()` calls in
`test_debugger` tests — required for pexpect
pattern matching now that examples no longer
hard-code log levels.
Also,
- make `expect()` return the decoded `before` str.
- add `start_method` param + fork-backend timeout
slack (+4s) in nested-error test.
- clean up debug examples: drop unused loglevels,
rename `n` -> `an`, fix docstrings, add TODO
comments for tpt parametrize via osenv.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 9431a81d37)
Add `disable_pdbp_color()` to the `sync_bp` example
to suppress pygments prompt coloring when
`PYTHON_COLORS=0` — makes pexpect pattern matching
deterministic.
Deats,
- set `loglevel='pdb'` in both script + test spawn.
- disable `enable_stack_on_sig` in example, assert
no `stackscope` output in test.
- update `attach_patts` keys/values with `|_<Task`
/ `|_<Thread` / `|_('subactor'` prefixes to match
actual tree-dump format.
- add call-site patterns (`tractor.pause_from_sync()`
`tractor.pause()`, `breakpoint(hide_tb=...)`).
- trim trailing `\n` from `Lock.repr()` output.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit fc2e298a29)
Factor the sub-actor relay loop out of
`dump_tree_on_sig()` into `_relay_sig_to_subactors()`
and chain both dump + relay in a single
`run_sync_soon` callback (`_dump_then_relay`) so the
parent's task-tree flushes BEFORE any sub receives
the signal — fixes a hierarchical-ordering race
where subs could dump ahead of the parent in the
muxed pty stream.
Also,
- gate file/tty sink writes behind `write_file` +
`write_tty` params on `dump_task_tree()`.
- use `actor.aid.uid` instead of deprecated `.uid`.
- update `test_shield_pause` expects to match the
new sequential parent -> relay-log -> sub ordering.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit e2b790a70d)
Move `--capture=sys` enforcement from a static ini
flag to a `pytest_load_initial_conftests()` bootstrap
hook that dynamically flips capture mode only when a
fork-based spawner (like `main_thread_forkserver`) is
detected; non-fork backends keep `--capture=fd`.
Also,
- load `tractor._testing.pytest` via `-p` in ini
(bc bootstrapping hooks must register before
conftest `pytest_plugins` runs).
- register `_reap` as sub-plugin via `pytest_plugins`
tuple in `._testing.pytest`.
- drop now-duplicate reap fixtures (already in `_reap`
per 1cdc7fb3).
- rename `tractor_enable_stackscope` dest -> `enable_stackscope`
and pop env var on disable.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 61d4525137)
In `tests/devx/conftest.py::spawn`, refactor the
fixture-internal closures so consumer tests can pass
explicit `start_method`/`loglevel` to each `_spawn()`
invocation rather than only inheriting the fixture-
scoped parametrize values.
Deats,
- promote `set_spawn_method()` and `set_loglevel()`
to take their respective values as fn params (vs
closing over the fixture-scope vars).
- give `_spawn()` `start_method=start_method` and
`loglevel: str|None = None` kwargs so callers
override one-off without re-parametrizing the
suite. NOTE: this drops the implicit fixture-
scoped `loglevel` forward — `_spawn()` callers
now must pass `loglevel=...` explicitly.
- TODO: figure out how `--ll <level>` should map to
the default (currently `None` → uses env-var or
tractor default).
- add a docstring to `_spawn()` so its role as the
consumer-facing closure is obvious from `help()`.
Also,
- `assert_before()` now returns the `.before` output
on success (was `None`); add a one-line docstring
describing the new return contract.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 486249d74f)
In `pyproject.toml`,
- include the `sync_pause` group from `dev`, so dev
installs ship `greenback` for `pause_from_sync()`.
Comment out per-test `@pytest.mark.timeout(...)`
markers in,
- `tests/devx/test_debugger.py`
- `tests/discovery/test_registrar.py`
- `tests/spawn/test_main_thread_forkserver.py`
- `tests/spawn/test_subint_cancellation.py`
- `tests/test_advanced_streaming.py`
- `tests/test_cancellation.py`
The global cap was already dropped (3c366cac); these
were the leftover per-test caps which now block
interactive `pdb` flows under the new spawn backends.
In `uv.lock`,
- pull `greenback` into the resolved `dev` deps
(per the `sync_pause` include above).
- catch up the prior `xonsh` editable→PyPI switch
(from the `pyproject.toml` `tool.uv.sources` edit).
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit b7115fc875)
(factored: dropped spawn-backend-only paths under tests/spawn/)
Add env-var overrides inside `._root.open_root_actor()` so
devs/test-runs can swap the actor-spawn backend or crank
console verbosity *without* touching application code.
In `._root.open_root_actor()`,
- read `TRACTOR_LOGLEVEL` early, overriding any caller-passed
`loglevel` and stashing an `env_ll_report` to emit once the
console log is set up.
- pull the `loglevel` fallback (`or _default_loglevel`) and
`log.get_console_log()` init *up* so the env-var report
routes through tractor's own logger.
- read `TRACTOR_SPAWN_METHOD`, overriding any caller-passed
`start_method` and warn-logging when the env-var clobbers
an explicit caller value.
Wire the same vars through `tests/devx/conftest.py::spawn`,
- request the `loglevel` fixture, set both `TRACTOR_LOGLEVEL`
and `TRACTOR_SPAWN_METHOD` in `os.environ` before each
`pexpect.spawn()` (inherited by the example subproc).
- expand `supported_spawners` to include
`main_thread_forkserver` and `subint_forkserver` bc
example scripts no longer need per-script CLI plumbing.
- pop both vars in fixture teardown so a leaked value can't
re-route a later in-process tractor test's spawn-backend
or loglevel.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 208e7c0926)
Mirror `060f7d24`'s pattern (backend-aware timeout in
`maybe_expect_raises`) for `test_dynamic_pub_sub`'s hard
`trio.fail_after` cap. Fork-based backends pay per-spawn
fork+IPC-handshake cost which stacks over `cpus - 1`
sequential `n.run_in_actor()` calls; empirically 12s
flakes on `main_thread_forkserver` under UDS
cross-pytest contention (#451 / #452).
Defaults:
- `main_thread_forkserver` → 30s
- everything else → 12s (unchanged)
Hoist the timeout-pick out of the `main()` closure so the
dispatch happens once in the trio task rather than
re-evaluating per spawn.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 383b0fdd75)
Default `timeout` from `int = 3` → `int|None = None`;
when unset, pick a backend-aware value. Fork-based
backends (`main_thread_forkserver`) need real headroom
bc actor spawn + IPC ctx-exit + msg-validation error
path is much heavier than under `trio` backend —
especially under cross-pytest-stream contention (#451).
Defaults:
- `main_thread_forkserver` → 30s
- everything else → 3s (unchanged)
Empirical flake history that motivated 30s as the floor
on fork backends (all from `test_basic_payload_spec`):
- 3s → all-valid variant flaked w/ `TooSlowError`
- 8s → `invalid-return` variant flaked w/ `Cancelled`
(surfaced instead of `MsgTypeError` bc the
outer `fail_after` fired mid-error-path)
- 15s → flaked under cross-pytest-stream contention
30s gives plenty of headroom while still failing-loud
on a genuine hang. Callers can opt out by passing an
explicit `timeout=` kw.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 060f7d24c4)
Drop `@pytest.mark.timeout(...)` for the per-test wall-clock
cap on `test_dynamic_pub_sub`; rely on `trio.fail_after(12)`
inside `main()` instead.
Both pytest-timeout enforcement modes are incompatible with
trio under fork-based backends:
- `method='signal'` (SIGALRM) synchronously raises `Failed`
in trio's main thread mid-`epoll.poll()`, leaving
`GLOBAL_RUN_CONTEXT` half-installed ("Trio guest run got
abandoned") so EVERY subsequent `trio.run()` in the same
pytest process bails with
`RuntimeError: Attempted to call run() from inside a run()`
— full-session poison.
- `method='thread'` calls `_thread.interrupt_main()` which
can let the KBI escape trio's `KIManager` under fork-
cascade teardown races and bubble out of pytest entirely
— kills the whole session.
`trio.fail_after()` keeps cancellation inside the trio loop:
- Raises `TooSlowError` cleanly through the open-nursery's
cancel cascade.
- Doesn't disturb any out-of-band signal/thread state.
- Failure stays scoped to the single test — no cross-test
global state corruption either way.
Verified empirically: 10 hammer-runs of `test_dynamic_pub_sub`
go from 5/10 fail (with global-state poison) to 3/10 fail
(no poison, all sibling tests still pass). The ~30%
remaining flake rate is a genuine fork-cancel-cascade
hang — separate from this fix but no longer contaminates.
Module-level NOTE comment explains the rationale so future
readers don't re-introduce the bug.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 530160fa69)
Function-scoped, NON-autouse zombie-subactor reaper for
modules whose teardown is known-leaky enough to cascade-
fail every following test in a session.
Sibling to the autouse session-scoped `_reap_orphaned_subactors`. The
session-scoped one fires at session end — too late to save tests that
follow a hung/leaky test in the suite. The new fixture, opted into via
`pytestmark = pytest.mark.usefixtures(...)`, runs between tests in
a problem-module so a leftover subactor from test N can't squat on
registrar ports / UDS paths / shm segments needed by tests N+1,
N+2, ...
Intentionally NOT autouse — the fixture's presence on a module signals
"this module's teardown leaks; please root-cause instead of relying
forever on cleanup". A visibility-vs-convenience trade picked in favor
of the former.
Apply to `tests/test_infected_asyncio.py` since both recent full-suite
runs (parallel-tpt-proto + TCP-only) showed the cascade originating in
this file's KBI- and SIGINT-flavored tests under
`main_thread_forkserver`. Module-comment names the specific offenders so
future de-flake work has a starting point.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit b376eb0332)
After the variant-1 / variant-2 backend split, update remaining
string-match refs to the variant-1 backend so user-visible gates
+ skip-marks + comments name the working backend correctly:
- `tractor._root._DEBUG_COMPATIBLE_BACKENDS`: include
`main_thread_forkserver`, drop the stub-only `subint_forkserver`
entry.
- `tests/test_spawning.py::test_loglevel_propagated_to_subactor`:
capfd-skip flips to `main_thread_forkserver`.
- `tests/test_infected_asyncio.py::test_sigint_closes_lifetime_stack`:
xfail-condition flips to `main_thread_forkserver`.
- `tests/test_shm.py`: drop stale "broken on `main_thread_forkserver`"
reason-text since the `mp.SharedMemory(track=False)`
+ resource-tracker monkey-patch in `.ipc._mp_bs` makes the tests pass;
the skip-mark only fires on plain `subint` now.
- Comment / docstring sweep: `runtime._state`, `runtime._runtime`,
`_testing.pytest`, `_subint.py`, `pyproject.toml`,
`test_cancellation.py`, `test_registrar.py` — refs to variant-1
backend updated.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 205382a39b)
(factored: dropped spawn-backend-only path: tractor/spawn/_subint.py)
Same wire-up pattern as the prior `test_dynamic_pub_sub`
commit: each test that already pulled in `debug_mode`
now also pulls in `reg_addr` and passes
`registry_addrs=[reg_addr]` into `tractor.open_nursery()`,
so the suite's standard registry-addr conventions apply.
Tests touched:
- `test_started_misuse`
- `test_simple_context`
- `test_parent_cancels`
- `test_one_end_stream_not_opened`
- `test_maybe_allow_overruns_stream`
- `test_ctx_with_self_actor`
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 66f1941f46)
Pull in the `reg_addr`, `debug_mode`, and `test_log`
fixtures so this test follows the same conventions as
the rest of the suite:
- pass `registry_addrs=[reg_addr]` + `debug_mode` into
`tractor.open_nursery()` (so `--tpdb` etc work).
- after the `pytest.raises` block, add `assert err` +
`test_log.exception('Timed out AS EXPECTED')` so the
expected timeout is logged explicitly instead of
swallowed.
Also,
- drop whitespace-only blank lines around the
`subs` param of `consumer()` and `ctx` param of
`one_task_streams_and_one_handles_reqresp()`.
- promote `test_sigint_both_stream_types`'s one-line
docstring to multi-line form.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 9b05f659b3)
Seems that when run in-suite it delays more then the so-measured "happy
path" timing; better to have no suite-global interruption then asserting
a fast single test's run.
(cherry picked from commit 65fcfbf224)
Implements the resolution described in c99d475d's
`subint_forkserver_mp_shared_memory_issue.md` (now
updated with the resolution post-mortem). Two-part
fix that side-steps `mp.resource_tracker` entirely
rather than try to make it fork-safe — turns out
that's both simpler AND more correct given tractor
already SC-manages allocation lifetimes.
Deats,
- `tractor/ipc/_mp_bs.py::disable_mantracker()`: drop the
`platform.python_version_tuple()[:-1] >= ('3', '13')` branch — patches
now run unconditionally:
* monkey-patch `mp.resource_tracker. _resource_tracker` to a no-op
`ManTracker` subclass (empty `register` / `unregister`
/ `ensure_running`).
* return `partial(SharedMemory, track=False)` for the per-allocation
opt-out.
* belt + suspenders: even if something dodges the wrapper, the
singleton can't talk to the inherited (broken) parent fd.
- `tractor/ipc/_shm.py::open_shm_list()`: drop the 3.13+ conditional
skip of the unlink-callback; install a `try_unlink()` wrapper that
swallows `FileNotFoundError` (sibling-already-cleaned race in
shared-key setups). Without `mp.resource_tracker` doing it for us, we
own the unlink — `actor. lifetime_stack` is the right place since
tractor already controls actor lifecycle.
- `tests/test_shm.py`: uncomment-out `subint_forkserver` from the
module-level skip- list (tests pass now). Inline comment cross-refs
the two `_mp_bs` / `_shm` workarounds.
- `ai/conc-anal/subint_forkserver_mp_shared_memory_ issue.md`: heavy
rewrite — flips status from "open / unresolvable in tractor" to
"resolved, kept as decision record". Adds Resolution section, "Why
this is the right call" rationale (mp tracker is widely criticized;
tractor already owns lifecycle), trade-offs (crash-leaked segments,
lost mp leak warning), verification (7 passed under both
`subint_forkserver` and `trio` backends), and upstream issue links
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit aa3e230926)
(factored: dropped subint_forkserver conc-anal doc update)
New `ai/conc-anal/` doc: `mp.SharedMemory` is
fork-without-exec unsafe — child inherits parent's
`resource_tracker` fd → EBADF on first shm op;
leaked `/shm_list` cascades `FileExistsError`
across parametrize variants. Canonical CPython
issue class, NOT a tractor bug. Includes two
longer-term mitigation paths (reset inherited
tracker fd vs migrate off `mp.shared_memory`).
Also, update `tests/test_shm.py`:
- comment out `subint_forkserver` from skip list
- rewrite reason with precise failure-mode
descriptions + link to the analysis doc
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit c99d475d03)
(factored: dropped spawn-backend-only paths: ai/conc-anal/subint_forkserver_mp_shared_memory_issue.md)
Continues the hygiene pattern from de601676 (cancel tests) into
`tests/test_infected_asyncio.py`: many tests here were calling
`tractor.open_nursery()` w/o `registry_addrs=[reg_addr]` and thus racing
on the default `:1616` registry across sessions. Thread the
session-unique `reg_addr` through so leaked or slow-to-teardown
subactors from a prior test can't cross-pollute.
Deats,
- add `registry_addrs=[reg_addr]` to `open_nursery()`
calls in suite where missing.
- `test_sigint_closes_lifetime_stack`:
- add `reg_addr`, `debug_mode`, `start_method`
fixture params
- `delay` now reads the `debug_mode` param directly
instead of calling `tractor.debug_mode()` (fires
slightly earlier in the test lifecycle)
- sanity assert `if debug_mode: assert
tractor.debug_mode()` after nursery open
- new print showing SIGINT target
(`send_sigint_to` + resolved pid)
- catch `trio.TooSlowError` around
`ctx.wait_for_result()` and conditionally
`pytest.xfail` when `send_sigint_to == 'child'
and start_method == 'subint_forkserver'` — the
known orphan-SIGINT limitation tracked in
`ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md`
- parametrize id typo fix: `'just_trio_slee'` → `'just_trio_sleep'`
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit b350aa09ee)
Lands the capture-pipe workaround from the prior cluster of diagnosis
commits: switch pytest's `--capture` mode from the default `fd`
(redirects fd 1,2 to temp files, which fork children inherit and can
deadlock writing into) to `sys` (only `sys.stdout` / `sys.stderr` — fd
1,2 left alone).
Trade-off documented inline in `pyproject.toml`:
- LOST: per-test attribution of raw-fd output (C-ext writes,
`os.write(2, ...)`, subproc stdout). Still goes to terminal / CI
capture, just not per-test-scoped in the failure report.
- KEPT: `print()` + `logging` capture per-test (tractor's logger uses
`sys.stderr`).
- KEPT: `pytest -s` debugging behavior.
This allows us to re-enable `test_nested_multierrors` without
skip-marking + clears the class of pytest-capture-induced hangs for any
future fork-based backend tests.
Deats,
- `pyproject.toml`: `'--capture=sys'` added to `addopts` w/ ~20 lines of
rationale comment cross-ref'ing the post-mortem doc
- `test_cancellation`: drop `skipon_spawn_backend('subint_forkserver')`
from `test_nested_ multierrors` — no longer needed.
* file-level `pytestmark` covers any residual.
- `tests/spawn/test_subint_forkserver.py`: orphan-SIGINT test's xfail
mark loosened from `strict=True` to `strict=False` + reason rewritten.
* it passes in isolation but is session-env-pollution sensitive
(leftover subactor PIDs competing for ports / inheriting harness
FDs).
* tolerate both outcomes until suite isolation improves.
- `test_shm`: extend the existing
`skipon_spawn_backend('subint', ...)` to also skip
`'subint_forkserver'`.
* Different root cause from the cancel-cascade class:
`multiprocessing.SharedMemory`'s `resource_tracker` + internals
assume fresh- process state, don't survive fork-without-exec cleanly
- `tests/discovery/test_registrar.py`: bump timeout 3→7s on one test
(unrelated to forkserver; just a flaky-under-load bump).
- `tractor.spawn._subint_forkserver`: inline comment-only future-work
marker right before `_actor_child_main()` describing the planned
conditional stdout/stderr-to-`/dev/null` redirect for cases where
`--capture=sys` isn't enough (no code change — the redirect logic
itself is deferred).
EXTRA NOTEs
-----------
The `--capture=sys` approach is the minimum- invasive fix: just a pytest
ini change, no runtime code change, works for all fork-based backends,
trade-offs well-understood (terminal-level capture still happens, just
not pytest's per-test attribution of raw-fd output).
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 4c133ab541)
(factored: dropped spawn-backend-only paths: tests/spawn/test_subint_forkserver.py + tractor/spawn/_subint_forkserver.py; the xfail-loosening bullet above no longer applies)
Refresh the `test_nested_multierrors` skip-mark
reason to the final diagnosis: the hang is pytest's
default `--capture=fd` pipe filling from high-volume
subactor traceback output inherited via fds 1,2 in
fork children — `pytest -s` passes cleanly. Records
the fix direction (redirect child stdio to
`/dev/null` in the fork-child prelude) for whoever
lands the backend.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit eceed29d4a)
(factored: kept only the tests/test_cancellation.py skip-reason update of
"Pin forkserver hang to pytest `--capture=fd`"; dropped the subint
conc-anal doc + tests/spawn/test_subint_forkserver.py)
Skip-mark the still-hanging
`test_nested_multierrors[subint_forkserver]` via
`@pytest.mark.skipon_spawn_backend('subint_forkserver',
reason=...)` so it stops blocking the test matrix
while the remaining bug is being chased. The mark is
an inert no-op until that (in-dev) backend lands.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 506617c695)
(factored: kept only the tests/test_cancellation.py skip-mark; dropped
the subint_forkserver conc-anal doc update)
Stopgap companion to d0121960 (`subint_forkserver`
test-cancellation leak doc): five tests in
`tests/test_cancellation.py` were running against the
default `:1616` registry, so any leaked
`subint-forkserv` descendant from a prior test holds
the port and blows up every subsequent run with
`TooSlowError` / "address in use". Thread the
session-unique `reg_addr` fixture through so each run
picks its own port — zombies can no longer poison
other tests (they'll only cross-contaminate whatever
happens to share their port, which is now nothing).
Deats,
- add `reg_addr: tuple` fixture param to:
- `test_cancel_infinite_streamer`
- `test_some_cancels_all`
- `test_nested_multierrors`
- `test_cancel_via_SIGINT`
- `test_cancel_via_SIGINT_other_task`
- explicitly pass `registry_addrs=[reg_addr]` to the
two `open_nursery()` calls that previously had no
kwargs at all (in `test_cancel_via_SIGINT` and
`test_cancel_via_SIGINT_other_task`)
- add bounded `@pytest.mark.timeout(7, method='thread')`
to `test_nested_multierrors` so a hung run doesn't
wedge the whole session
Still doesn't close the real leak — the
`subint_forkserver` backend's `_ForkedProc.kill()` is
PID-scoped not tree-scoped, so grandchildren survive
teardown regardless of registry port. This commit is
just blast-radius containment until that fix lands.
See `ai/conc-anal/
subint_forkserver_test_cancellation_leak_issue.md`.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 1af2121057)
Adopt the `@pytest.mark.skipon_spawn_backend('subint',
reason=...)` marker (a617b521) across the suites
reproducing the `subint` GIL-contention / starvation
hang classes doc'd in `ai/conc-anal/subint_*_issue.md`.
Deats,
- Module-level `pytestmark` on full-file-hanging suites:
- `tests/test_cancellation.py`
- `tests/test_inter_peer_cancellation.py`
- `tests/test_pubsub.py`
- `tests/test_shm.py`
- Per-test decorator where only one test in the file
hangs:
- `tests/discovery/test_registrar.py
::test_stale_entry_is_deleted` — replaces the
inline `if start_method == 'subint': pytest.skip`
branch with a declarative skip.
- `tests/test_subint_cancellation.py
::test_subint_non_checkpointing_child`.
- A few per-test decorators are left commented-in-
place as breadcrumbs for later finer-grained unskips.
Also, some nearby tidying in the affected files:
- Annotate loose fixture / test params
(`pytest.FixtureRequest`, `str`, `tuple`, `bool`) in
`tests/conftest.py`, `tests/devx/conftest.py`, and
`tests/test_cancellation.py`.
- Normalize `"""..."""` → `'''...'''` docstrings per
repo convention on a few touched tests.
- Add `timeout=6` / `timeout=10` to
`@tractor_test(...)` on `test_cancel_infinite_streamer`
and `test_some_cancels_all`.
- Drop redundant `spawn_backend` param from
`test_cancel_via_SIGINT`; use `start_method` in the
`'mp' in ...` check instead.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 4b2a0886c3)
(factored: dropped spawn-backend-only path: tests/test_subint_cancellation.py)
Add a hard process-level wall-clock bound on a test
known to wedge un-Ctrl-C-ably under an in-dev spawn
backend, so an unattended suite run can't hang
indefinitely.
Deats,
- New `testing` dep: `pytest-timeout>=2.3`.
- `test_stale_entry_is_deleted`:
`@pytest.mark.timeout(3, method='thread')`. The
`method='thread'` choice is deliberate —
`method='signal'` routes via `SIGALRM` which can be
starved by the same GIL-hostage path that drops
`SIGINT`, so it'd never actually fire in the
starvation case.
At timeout, `pytest-timeout` hard-kills the pytest
process itself — that's the intended behavior here;
the alternative is the suite never returning.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 189f4e3f72e9f1eda5d24bcbab5743f7e35bd913)
(factored: kept pyproject + tests/discovery/test_registrar.py parts of
"Wall-cap `subint` audit tests via `pytest-timeout`"; dropped
tests/test_subint_cancellation.py)
Wrap the test's `trio.run(main)` in
`dump_on_hang(seconds=20)` so any future hang
regression captures a stack dump for triage instead
of wedging CI silently; under the default backends
it's a no-op safety net.
Includes a "KNOWN ISSUE" comment block documenting
the (future) `subint` backend hang classes observed
against this test during Phase B bringup (#379).
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 4a3254583b)
(factored: kept only the tests/discovery/test_registrar.py part of
"Doc `subint` backend hang classes + arm `dump_on_hang`"; dropped
subint conc-anal docs + tests/test_subint_cancellation.py)
Since we're devving subints we require the 3.14+ stdlib API
and a couple compiled libs don't support it yet, namely:
- `cffi`, which we're only using for the `.ipc._linux` eventfd
stuff (now factored into `hotbaud` anyway).
- `greenback`, which requires `greenlet` which doesn't seem to be
wheeled yet
* on nixos the sdist build was failing due to lack of `g++` which
i don't care to figure out rn since we don't need `.devx` stuff
immediately for this subints prototype.
* [ ] we still need to adjust any dependent suites to skip.
Adjust `test_ringbuf` to skip on import failure.
Also project wide,
- pin us to py 3.13+ in prep for last-2-minor-version policy.
- drop `msgspec>=0.20.0`, the first release with py3.14 support.
(cherry picked from commit d2ea8aa2de)
Import and apply `cpu_scaling_factor()` from
`conftest`; bump base from 3.6 -> 4 and multiply
through so CI boxes with slow CPUs don't flake.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Deats,
- use `proc.poll() is None` in `sig_prog()` to
distinguish "still running" from exit code 0;
drop stale `breakpoint()` from fallback kill
path (would hang CI).
- add missing `raise` on the `RuntimeError` in
`async_main()` when no tpt bind addrs given.
- clean up stale uid entries from the registrar
`_registry` when addr eviction empties the
addr list.
- update `discovery.__init__` docstring to match
the new eager `._multiaddr` import.
- fix `registar` -> `registrar` typo in teardown
report log msg.
Review: PR #429 (Copilot)
https://github.com/goodboy/tractor/pull/429
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
`get_random()` can produce the same UDS filename for a given
pid+actor-state, so the "disjoint addrs" premise doesn't always hold.
Gate the `len(bound) >= 2` assertion on whether the registry and bind
addrs actually differ via `expect_disjoint`.
Also,
- drop unused `partial` import
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Retry signal delivery in `sig_prog()` up to `tries`
times (default 3) w/ `canc_timeout` sleep between
attempts; only fall back to `_KILL_SIGNAL` after all
retries exhaust. Bump default timeout 0.1 -> 0.2.
Also,
- `test_multi_nested_subactors_error_through_nurseries`
gives the first prompt iteration a 5s timeout even
on linux bc the initial crash sequence can be slow
to arrive at a `pdb` prompt
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Adjust all imports to match.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Provide a service-table parsing API for downstream projects (like
`piker`) to declare per-actor transport bind addresses as a config map
of actor-name -> multiaddr strings (e.g. from a TOML `[network]`
section).
Deats,
- `EndpointsTable` type alias: input `dict[str, list[str|tuple]]`.
- `ParsedEndpoints` type alias: output `dict[str, list[Address]]`.
- `parse_endpoints()` iterates the table and delegates each entry to the
existing `tractor.discovery._discovery.wrap_address()` helper, which
handles maddr strings, raw `(host, port)` tuples, and pre-wrapped
`Address` objs.
- UDS maddrs use the multiaddr spec name `/unix/...` (not tractor's
internal `/uds/` proto_key)
Also add new tests,
- 7 new pure unit tests (no trio runtime): TCP-only, mixed tpts,
unwrapped tuples, mixed str+tuple, unsupported proto (`/udp/`),
empty table, empty actor list
- all 22 multiaddr tests pass rn.
Prompt-IO:
ai/prompt-io/claude/20260413T205048Z_269d939c_prompt_io.md
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Add 9 test variants (6 fns) covering all three
`tpt_bind_addrs` code paths in `open_root_actor()`:
- registrar w/ explicit bind (eq, subset, disjoint)
- non-registrar w/ explicit bind (same/diff
bindspace) using `daemon` fixture
- non-registrar default random bind (baseline)
- maddr string input parsing
- registrar merge produces union
- `open_nursery()` forwards `tpt_bind_addrs`
Fix type-mixing bug at `_root.py:446` where the
registrar merge path did `set(Address + tuple)`,
preventing dedup and causing double-bind `OSError`.
Wrap `uw_reg_addrs` before the set union so both
sides are `Address` objs.
Also,
- add prompt-io output log for this session
- stage original prompt input for tracking
Prompt-IO: ai/prompt-io/claude/20260413T192116Z_f851f28_prompt_io.md
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Use `cpu_scaling_factor()` headroom in
`test_peer_spawns_and_cancels_service_subactor`'s `fail_after` to avoid
flaky timeouts on throttled CI runners. Rename `arbiter_addr=` ->
`registry_addrs=[..]` throughout `test_spawning` and
`test_task_broadcasting` suites to match the current `open_root_actor()`
/ `open_nursery()` API.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Drop the `.lstrip('/')` on the unix protocol value
so the lib-prepended `/` restores the absolute-path
semantics that `mk_maddr()` strips when encoding.
Pass `Path` components (not `str`) to `UDSAddress`.
Also, update all UDS test params to use absolute
paths (`/tmp/tractor_test/...`, `/tmp/tractor_rt/...`)
matching real runtime sockpath behavior; tighten
`test_parse_maddr_uds` to assert exact `filedir`.
Review: PR #429 (copilot-pull-request-reviewer[bot])
https://github.com/goodboy/tractor/pull/429#pullrequestreview-4018448152
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Cover `parse_maddr()` with unit tests for tcp/ipv4,
tcp/ipv6, uds, and unsupported-protocol error paths,
plus full `addr -> mk_maddr -> str -> parse_maddr`
roundtrip verification.
Adds,
- a `_maddr_to_tpt_proto` inverse-mapping assertion.
- an `wrap_address()` maddr-string acceptance test.
- a `test_reg_then_unreg_maddr` end-to-end suite which audits passing
the registry addr as multiaddr str through the entire runtime.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
All tests are registrar-actor integration scenarios
sharing intertwined helpers + `enable_modules=[__name__]`
task fns, so keep as one mod but rename to reflect
content. Now lives alongside `test_multiaddr.py` in
the new `tests/discovery/` subpkg.
Also,
- update 5 refs in `/run-tests` SKILL.md to match
the new path
- add `discovery/` subdir to the test directory
layout tree
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Cover `_tpt_proto_to_maddr` mapping, TCP (ipv4/ipv6),
UDS, unsupported `proto_key` error, and round-trip
re-parse for both transport types.
Deats,
- new `tests/discovery/` subpkg w/ empty `__init__.py`
- `test_tpt_proto_to_maddr_mapping`: verify `tcp` and
`uds` entries
- `test_mk_maddr_tcp_ipv4`: full assertion on
`/ip4/127.0.0.1/tcp/1234` incl protocol iteration
- `test_mk_maddr_tcp_ipv6`: verify `/ip6/::1/tcp/5678`
- `test_mk_maddr_uds`: relative `filedir` bc the
multiaddr parser rejects double-slash from abs paths
- `test_mk_maddr_unsupported_proto_key`: `ValueError`
on `proto_key='quic'` via `SimpleNamespace` mock
- `test_mk_maddr_roundtrip`: parametrized over tcp +
uds, re-parse `str(maddr)` back through `Multiaddr`
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code