Continues the hygiene pattern from de601676 (cancel tests) into
`tests/test_infected_asyncio.py`: many tests here were calling
`tractor.open_nursery()` w/o `registry_addrs=[reg_addr]` and thus racing
on the default `:1616` registry across sessions. Thread the
session-unique `reg_addr` through so leaked or slow-to-teardown
subactors from a prior test can't cross-pollute.
Deats,
- add `registry_addrs=[reg_addr]` to `open_nursery()`
calls in suite where missing.
- `test_sigint_closes_lifetime_stack`:
- add `reg_addr`, `debug_mode`, `start_method`
fixture params
- `delay` now reads the `debug_mode` param directly
instead of calling `tractor.debug_mode()` (fires
slightly earlier in the test lifecycle)
- sanity assert `if debug_mode: assert
tractor.debug_mode()` after nursery open
- new print showing SIGINT target
(`send_sigint_to` + resolved pid)
- catch `trio.TooSlowError` around
`ctx.wait_for_result()` and conditionally
`pytest.xfail` when `send_sigint_to == 'child'
and start_method == 'subint_forkserver'` — the
known orphan-SIGINT limitation tracked in
`ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md`
- parametrize id typo fix: `'just_trio_slee'` → `'just_trio_sleep'`
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Lands the capture-pipe workaround from the prior cluster of diagnosis
commits: switch pytest's `--capture` mode from the default `fd`
(redirects fd 1,2 to temp files, which fork children inherit and can
deadlock writing into) to `sys` (only `sys.stdout` / `sys.stderr` — fd
1,2 left alone).
Trade-off documented inline in `pyproject.toml`:
- LOST: per-test attribution of raw-fd output (C-ext writes,
`os.write(2, ...)`, subproc stdout). Still goes to terminal / CI
capture, just not per-test-scoped in the failure report.
- KEPT: `print()` + `logging` capture per-test (tractor's logger uses
`sys.stderr`).
- KEPT: `pytest -s` debugging behavior.
This allows us to re-enable `test_nested_multierrors` without
skip-marking + clears the class of pytest-capture-induced hangs for any
future fork-based backend tests.
Deats,
- `pyproject.toml`: `'--capture=sys'` added to `addopts` w/ ~20 lines of
rationale comment cross-ref'ing the post-mortem doc
- `test_cancellation`: drop `skipon_spawn_backend('subint_forkserver')`
from `test_nested_ multierrors` — no longer needed.
* file-level `pytestmark` covers any residual.
- `tests/spawn/test_subint_forkserver.py`: orphan-SIGINT test's xfail
mark loosened from `strict=True` to `strict=False` + reason rewritten.
* it passes in isolation but is session-env-pollution sensitive
(leftover subactor PIDs competing for ports / inheriting harness
FDs).
* tolerate both outcomes until suite isolation improves.
- `test_shm`: extend the existing
`skipon_spawn_backend('subint', ...)` to also skip
`'subint_forkserver'`.
* Different root cause from the cancel-cascade class:
`multiprocessing.SharedMemory`'s `resource_tracker` + internals
assume fresh- process state, don't survive fork-without-exec cleanly
- `tests/discovery/test_registrar.py`: bump timeout 3→7s on one test
(unrelated to forkserver; just a flaky-under-load bump).
- `tractor.spawn._subint_forkserver`: inline comment-only future-work
marker right before `_actor_child_main()` describing the planned
conditional stdout/stderr-to-`/dev/null` redirect for cases where
`--capture=sys` isn't enough (no code change — the redirect logic
itself is deferred).
EXTRA NOTEs
-----------
The `--capture=sys` approach is the minimum- invasive fix: just a pytest
ini change, no runtime code change, works for all fork-based backends,
trade-offs well-understood (terminal-level capture still happens, just
not pytest's per-test attribution of raw-fd output).
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Sixth and final diagnostic pass — after all 4
cascade fixes landed (FD hygiene, pidfd wait,
`_parent_chan_cs` wiring, bounded peer-clear), the
actual last gate on
`test_nested_multierrors[subint_forkserver]`
turned out to be **pytest's default
`--capture=fd` stdout/stderr capture**, not
anything in the runtime cascade.
Empirical result: `pytest -s` → test PASSES in
6.20s. Default `--capture=fd` → hangs forever.
Mechanism: pytest replaces the parent's fds 1,2
with pipe write-ends it reads from. Fork children
inherit those pipes (since `_close_inherited_fds`
correctly preserves stdio). The error-propagation
cascade in a multi-level cancel test generates
7+ actors each logging multiple `RemoteActorError`
/ `ExceptionGroup` tracebacks — enough output to
fill Linux's 64KB pipe buffer. Writes block,
subactors can't progress, processes don't exit,
`_ForkedProc.wait` hangs.
Self-critical aside: I earlier tested w/ and w/o
`-s` and both hung, concluding "capture-pipe
ruled out". That was wrong — at that time fixes
1-4 weren't all in place, so the test was
failing at deeper levels long before reaching
the "produce lots of output" phase. Once the
cascade could actually tear down cleanly, enough
output flowed to hit the pipe limit. Order-of-
operations mistake: ruling something out based
on a test that was failing for a different
reason.
Deats,
- `subint_forkserver_test_cancellation_leak_issue
.md`: new section "Update — VERY late: pytest
capture pipe IS the final gate" w/ DIAG timeline
showing `trio.run` fully returns, diagnosis of
pipe-fill mechanism, retrospective on the
earlier wrong ruling-out, and fix direction
(redirect subactor stdout/stderr to `/dev/null`
in fork-child prelude, conditional on
pytest-detection or opt-in flag)
- `tests/test_cancellation.py`: skip-mark reason
rewritten to describe the capture-pipe gate
specifically; cross-refs the new doc section
- `tests/spawn/test_subint_forkserver.py`: the
orphan-SIGINT test regresses back to xfail.
Previously passed after the FD-hygiene fix,
but the new `wait_for_no_more_peers(
move_on_after=3.0)` bound in `async_main`'s
teardown added up to 3s latency, pushing
orphan-subactor exit past the test's 10s poll
window. Real fix: faster orphan-side teardown
OR extend poll window to 15s
No runtime code changes in this commit — just
test-mark adjustments + doc wrap-up.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Two-part stopgap for the still-hanging
`test_nested_multierrors[subint_forkserver]`:
1. Skip-mark the test via
`@pytest.mark.skipon_spawn_backend('subint_forkserver',
reason=...)` so it stops blocking the test
matrix while the remaining bug is being chased.
The reason string cross-refs the conc-anal doc
for full context.
2. Update the conc-anal doc
(`subint_forkserver_test_cancellation_leak_issue.md`) with the
empirical state after the three nested- cancel fix commits
(`0cd0b633` FD scrub + `fe540d02` pidfd wait + `57935804` parent-chan
shield break) landed, narrowing the remaining hang from "everything
broken" to "peer-channel loops don't exit on `service_tn` cancel".
Deats from the DIAGDEBUG instrumentation pass,
- 80 `process_messages` ENTERs, 75 EXITs → 5 stuck
- ALL 40 `shield=True` ENTERs matched EXIT — the
`_parent_chan_cs.cancel()` wiring from `57935804`
works as intended for shielded loops.
- the 5 stuck loops are all `shield=False` peer-
channel handlers in `handle_stream_from_peer`
(inbound connections handled by
`stream_handler_tn`, which IS `service_tn` in the
current config).
- after `_parent_chan_cs.cancel()` fires, NEW
shielded loops appear on the session reg_addr
port — probably discovery-layer reconnection;
doesn't block teardown but indicates the cascade
has more moving parts than expected.
The remaining unknown: why don't the 5 peer-channel loops exit when
`service_tn.cancel_scope.cancel()` fires? They're not shielded, they're
inside the service_tn scope, a standard cancel should propagate through.
Some fork-config-specific divergence keeps them alive. Doc lists three
follow-up experiments (stackscope dump, side-by-side `trio_proc`
comparison, audit of the `tractor/ipc/_server.py:448` `except
trio.Cancelled:` path).
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Two coordinated improvements to the `subint_forkserver` backend:
1. Replace `trio.to_thread.run_sync(os.waitpid, ...,
abandon_on_cancel=False)` in `_ForkedProc.wait()`
with `trio.lowlevel.wait_readable(pidfd)`. The
prior version blocked a trio cache thread on a
sync syscall — outer cancel scopes couldn't
unwedge it when something downstream got stuck.
Same pattern `trio.Process.wait()` and
`proc_waiter` (the mp backend) already use.
2. Drop the `@pytest.mark.xfail(strict=True)` from
`test_orphaned_subactor_sigint_cleanup_DRAFT` —
the test now PASSES after 0cd0b633 (fork-child
FD scrub). Same root cause as the nested-cancel
hang: inherited IPC/trio FDs were poisoning the
child's event loop. Closing them lets SIGINT
propagation work as designed.
Deats,
- `_ForkedProc.__init__` opens a pidfd via
`os.pidfd_open(pid)` (Linux 5.3+, Python 3.9+)
- `wait()` parks on `trio.lowlevel.wait_readable()`,
then non-blocking `waitpid(WNOHANG)` to collect
the exit status (correct since the pidfd signal
IS the child-exit notification)
- `ChildProcessError` swallow handles the rare race
where someone else reaps first
- pidfd closed after `wait()` completes (one-shot
semantics) + `__del__` belt-and-braces for
unexpected-teardown paths
- test docstring's `@xfail` block replaced with a
`# NOTE` comment explaining the historical
context + cross-ref to the conc-anal doc; test
remains in place as a regression guard
The two changes are interdependent — the
cancellable `wait()` matters for the same nested-
cancel scenarios the FD scrub fixes, since the
original deadlock had trio cache workers wedged in
`os.waitpid` swallowing the outer cancel.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Stopgap companion to d0121960 (`subint_forkserver`
test-cancellation leak doc): five tests in
`tests/test_cancellation.py` were running against the
default `:1616` registry, so any leaked
`subint-forkserv` descendant from a prior test holds
the port and blows up every subsequent run with
`TooSlowError` / "address in use". Thread the
session-unique `reg_addr` fixture through so each run
picks its own port — zombies can no longer poison
other tests (they'll only cross-contaminate whatever
happens to share their port, which is now nothing).
Deats,
- add `reg_addr: tuple` fixture param to:
- `test_cancel_infinite_streamer`
- `test_some_cancels_all`
- `test_nested_multierrors`
- `test_cancel_via_SIGINT`
- `test_cancel_via_SIGINT_other_task`
- explicitly pass `registry_addrs=[reg_addr]` to the
two `open_nursery()` calls that previously had no
kwargs at all (in `test_cancel_via_SIGINT` and
`test_cancel_via_SIGINT_other_task`)
- add bounded `@pytest.mark.timeout(7, method='thread')`
to `test_nested_multierrors` so a hung run doesn't
wedge the whole session
Still doesn't close the real leak — the
`subint_forkserver` backend's `_ForkedProc.kill()` is
PID-scoped not tree-scoped, so grandchildren survive
teardown regardless of registry port. This commit is
just blast-radius containment until that fix lands.
See `ai/conc-anal/
subint_forkserver_test_cancellation_leak_issue.md`.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Empirical follow-up to the xfail'd orphan-SIGINT test:
the hang is **not** "trio can't install a handler on a
non-main thread" (the original hypothesis from the
`child_sigint` scaffold commit). On py3.14:
- `threading.current_thread() is threading.main_thread()`
IS True post-fork — CPython re-designates the
fork-inheriting thread as "main" correctly
- trio's `KIManager` SIGINT handler IS installed in the
subactor (`signal.getsignal(SIGINT)` confirms)
- the kernel DOES deliver SIGINT to the thread
But `faulthandler` dumps show the subactor wedged in
`trio/_core/_io_epoll.py::get_events` — trio's
wakeup-fd mechanism (which turns SIGINT into an epoll-wake)
isn't firing. So the `except KeyboardInterrupt` at
`tractor/spawn/_entry.py::_trio_main:164` — the runtime's
intentional "KBI-as-OS-cancel" path — never fires.
Deats,
- new `ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md`
(+385 LOC): full writeup — TL;DR, symptom reproducer,
the "intentional cancel path" the bug defeats,
diagnostic evidence (`faulthandler` output +
`getsignal` probe), ruled-out hypotheses
(non-main-thread issue, wakeup-fd inheritance,
KBI-as-trio-check-exception), and fix directions
- `test_orphaned_subactor_sigint_cleanup_DRAFT` xfail
`reason` + test docstring rewritten to match the
refined understanding — old wording blamed the
non-main-thread path, new wording points at the
`epoll_wait` wedge + cross-refs the new conc-anal doc
- `_subint_forkserver` module docstring's
`child_sigint='trio'` bullet updated: now notes trio's
handler is already correctly installed, so the flag may
end up a no-op / doc-only mode once the real root cause
is fixed
Closing the gap aligns with existing design intent (make
the already-designed "KBI-as-OS-cancel" behavior actually
fire), not a new feature.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Tier-4 test `test_orphaned_subactor_sigint_cleanup_DRAFT`
documents an empirical SIGINT-delivery gap in the
`subint_forkserver` backend: when the parent dies via
`SIGKILL` (no IPC `Portal.cancel_actor()` possible) and
`SIGINT` is sent to the orphan child, the child DOES NOT
unwind — CPython's default `KeyboardInterrupt` is delivered
to `threading.main_thread()`, whose tstate is dead in the
post-fork child bc fork inherited the worker thread, not
main. Trio running on the fork-inherited worker thread
therefore never observes the signal. Marked
`xfail(strict=True)` so the mark flips to XPASS→fail once
the backend grows explicit SIGINT plumbing.
Deats,
- harness runs the failure-mode sequence out-of-process:
1. harness subprocess runs a fresh Python script
that calls `try_set_start_method('subint_forkserver')`
then opens a root actor + one `sleep_forever` subactor
2. parse `PARENT_READY=<pid>` + `CHILD_PID=<pid>` markers
off harness `stdout` to confirm IPC handshake
completed
3. `SIGKILL` the parent, `proc.wait()` to reap the
zombie (otherwise `os.kill(pid, 0)` keeps reporting
it alive)
4. assert the child survived the parent-reap (i.e. was
actually orphaned, not reaped too) before moving on
5. `SIGINT` the orphan child, poll `os.kill(child_pid, 0)`
every 100ms for up to 10s
- supporting helpers: `_read_marker()` with per-proc
bytes-buffer to carry partial lines across calls,
`_process_alive()` liveness probe via `kill(pid, 0)`
- Linux-only via `platform.system() != 'Linux'` skip —
orphan-reparenting semantics don't generalize to
other platforms
- port offset (`reg_addr[1] + 17`) so the harness listener
doesn't race concurrently-running backend tests
- best-effort `finally:` cleanup: `SIGKILL` any still-alive
pids + `proc.kill()` + bounded `proc.wait()` to avoid
leaking orphans across the session
Also, tier-4 header comment documents the cross-backend
generalization path: applicable to any multi-process
backend (`trio`, `mp_spawn`, `mp_forkserver`,
`subint_forkserver`), NOT to plain `subint` (in-process
subints have no orphan OS-child). Move path: lift
harness into `tests/_orphan_harness.py`, parametrize on
session `_spawn_method`, add
`skipif _spawn_method == 'subint'`.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Promote `_subint_forkserver` from primitives-only into a
registered spawn backend: `'subint_forkserver'` is now a
`SpawnMethodKey` literal, dispatched via `_methods` to
the new `subint_forkserver_proc()` target, feature-gated
under the existing `subint`-family py3.14+ case, and
selectable via `--spawn-backend=subint_forkserver`.
Deats,
- new `subint_forkserver_proc()` spawn target in
`_subint_forkserver`:
- mirrors `trio_proc()`'s supervision model — real OS
subprocess so `Portal.cancel_actor()` + `soft_kill()`
on graceful teardown, `os.kill(SIGKILL)` on hard-reap
(no `_interpreters.destroy()` race to fuss over bc the
child lives in its own process)
- only real diff from `trio_proc` is the spawn mechanism:
fork from a main-interp worker thread via
`fork_from_worker_thread()` (off-loaded to trio's
thread pool) instead of `trio.lowlevel.open_process()`
- child-side `_child_target` closure runs
`tractor._child._actor_child_main()` with
`spawn_method='trio'` — the child is a regular trio
actor, "subint_forkserver" names how the parent
spawned, not what the child runs
- new `_ForkedProc` class — thin `trio.Process`-compatible
shim around a raw OS pid: `.poll()` via
`waitpid(WNOHANG)`, async `.wait()` off-loaded to a trio
cache thread, `.kill()` via `SIGKILL`, `.returncode`
cached for repeat calls. `.stdin`/`.stdout`/`.stderr`
are `None` (fork-w/o-exec inherits parent FDs; we don't
marshal them) which matches `soft_kill()`'s `is not None`
guards
Also, new backend-tier test
`test_subint_forkserver_spawn_basic` drives the registered
backend end-to-end via `open_root_actor` + `open_nursery` +
`run_in_actor` w/ a trivial portal-RPC round-trip. Uses a
`forkserver_spawn_method` fixture to flip
`_spawn_method`/`_ctx` for the test's duration + restore on
teardown (so other session-level tests don't observe the
global flip). Test module docstring reworked to describe
the three tiers now covered: (1) primitive-level, (2)
parent-trio-driven primitives, (3) full registered backend.
Status: still-open work (tracked on `tractor#379`) doc'd
inline in the module docstring — no cancel/hard-kill stress
coverage yet, child-side subint-hosted root runtime still
future (gated on `msgspec#563`), thread-hygiene audit
pending the same unblock.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
New pytest module `tests/spawn/test_subint_forkserver.py`
drives the forkserver primitives from inside a real
`trio.run()` in the parent — the runtime shape tractor will
actually use when we wire up a `subint_forkserver` spawn
backend proper. Complements the standalone no-trio-in-parent
`ai/conc-anal/subint_fork_from_main_thread_smoketest.py`.
Deats,
- new test pkg `tests/spawn/` (+ empty `__init__.py`)
- two tests, both `@pytest.mark.timeout(30, method='thread')`
for the GIL-hostage safety reason doc'd in
`ai/conc-anal/subint_sigint_starvation_issue.md`:
- `test_fork_from_worker_thread_via_trio` — parent-side
plumbing baseline. `trio.run()` off-loads forkserver
prims via `trio.to_thread.run_sync()` + asserts the
child reaps cleanly
- `test_fork_and_run_trio_in_child` — end-to-end: forked
child calls `run_subint_in_worker_thread()` with a
bootstrap str that does `trio.run()` in a fresh subint
- both tests wrap the inner `trio.run()` in a
`dump_on_hang()` for post-mortem if the outer
`pytest-timeout` fires
- intentionally NOT using `--spawn-backend` — the tests
drive the primitives directly rather than going through
tractor's spawn-method registry (which the forkserver
isn't plugged into yet)
Also, rename `run_trio_in_subint()` →
`run_subint_in_worker_thread()` for naming consistency with
the sibling `fork_from_worker_thread()`. The action is really
"host a subint on a worker thread", not specifically "run
trio" — trio just happens to be the typical payload.
Propagate the rename to the smoketest.
Further, add a "TODO — cleanup gated on msgspec PEP 684
support" section to the `_subint_forkserver` module
docstring: flags the dedicated-`threading.Thread` design as
potentially-revisable once isolated-mode subints are viable
in tractor. Cross-refs `msgspec#563` + `tractor#379` and
points at an audit-plan conc-anal doc we'll add next.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Adopt the `@pytest.mark.skipon_spawn_backend('subint',
reason=...)` marker (a617b521) across the suites
reproducing the `subint` GIL-contention / starvation
hang classes doc'd in `ai/conc-anal/subint_*_issue.md`.
Deats,
- Module-level `pytestmark` on full-file-hanging suites:
- `tests/test_cancellation.py`
- `tests/test_inter_peer_cancellation.py`
- `tests/test_pubsub.py`
- `tests/test_shm.py`
- Per-test decorator where only one test in the file
hangs:
- `tests/discovery/test_registrar.py
::test_stale_entry_is_deleted` — replaces the
inline `if start_method == 'subint': pytest.skip`
branch with a declarative skip.
- `tests/test_subint_cancellation.py
::test_subint_non_checkpointing_child`.
- A few per-test decorators are left commented-in-
place as breadcrumbs for later finer-grained unskips.
Also, some nearby tidying in the affected files:
- Annotate loose fixture / test params
(`pytest.FixtureRequest`, `str`, `tuple`, `bool`) in
`tests/conftest.py`, `tests/devx/conftest.py`, and
`tests/test_cancellation.py`.
- Normalize `"""..."""` → `'''...'''` docstrings per
repo convention on a few touched tests.
- Add `timeout=6` / `timeout=10` to
`@tractor_test(...)` on `test_cancel_infinite_streamer`
and `test_some_cancels_all`.
- Drop redundant `spawn_backend` param from
`test_cancel_via_SIGINT`; use `start_method` in the
`'mp' in ...` check instead.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Add a hard process-level wall-clock bound on the two
known-hanging subint-backend tests so an unattended
suite run can't wedge indefinitely in either of the
hang classes doc'd in `ai/conc-anal/`.
Deats,
- New `testing` dep: `pytest-timeout>=2.3`.
- `test_stale_entry_is_deleted`:
`@pytest.mark.timeout(3, method='thread')`. The
`method='thread'` choice is deliberate —
`method='signal'` routes via `SIGALRM` which is
starved by the same GIL-hostage path that drops
`SIGINT` (see `subint_sigint_starvation_issue.md`),
so it'd never actually fire in the starvation case.
- `test_subint_non_checkpointing_child`: same
decorator, same reasoning (defense-in-depth over
the inner `trio.fail_after(15)`).
At timeout, `pytest-timeout` hard-kills the pytest
process itself — that's the intended behavior here;
the alternative is the suite never returning.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Classify and write up the two distinct hang modes hit during Phase
B subint bringup (issue #379) so future triage doesn't re-derive them
from scratch.
Deats, two new `ai/conc-anal/` docs,
- `subint_sigint_starvation_issue.md`: abandoned legacy-subint thread
+ shared GIL → main trio loop starves → signal-wakeup-fd pipe fills
→ `SIGINT` silently dropped (`strace` shows `write() = EAGAIN` on the
wakeup-fd). Un- Ctrl-C-able. Structurally a CPython limit; blocked on
`msgspec` PEP 684 (jcrist/msgspec#563)
- `subint_cancel_delivery_hang_issue.md`: parent-side trio task parks on
an orphaned IPC channel after subint teardown — no clean EOF delivered
to the waiting receive. Ctrl-C-able (main loop iterates fine); OUR bug
to fix. Candidate fix: explicit parent-side channel abort in
`subint_proc`'s hard-kill teardown
Cross-link the docs from their test reproducers,
- `test_stale_entry_is_deleted` (→ starvation class): wrap
`trio.run(main)` in `dump_on_hang(seconds=20)` so a future regression
captures a stack dump. Kept un- skipped so the dump file is
inspectable
- `test_subint_non_checkpointing_child` (→ delivery class): extend
docstring with a "KNOWN ISSUE" block pointing at the analysis
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Lock in the escape-hatch machinery added to `tractor.spawn._subint`
during the Phase B.2/B.3 bringup (issue #379) so future stdlib
regressions or our own refactors don't silently re-introduce the
mid-suite hangs.
Deats,
- `test_subint_happy_teardown`: baseline — spawn a subactor, one portal
RPC, clean teardown. If this breaks, something's wrong unrelated to
the hard-kill shields.
- `test_subint_non_checkpointing_child`: cancel a subactor stuck in
a non-checkpointing Python loop (`threading.Event.wait()` releases the
GIL but never inserts a trio checkpoint). Validates the bounded-shield
+ daemon-driver-thread combo abandons the thread after
`_HARD_KILL_TIMEOUT`.
Every test is wrapped in `trio.fail_after()` for a deterministic
per-test wall-clock ceiling (an unbounded audit would defeat itself) and
arms `tractor.devx.dump_on_hang()` so a hang captures a stack dump
— pytest's stderr capture swallows `faulthandler` output by default.
Gated via `pytest.importorskip('concurrent.interpreters')` and
a module-level skip when `--spawn-backend` isn't `'subint'`.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Since we're devving subints we require the 3.14+ stdlib API
and a couple compiled libs don't support it yet, namely:
- `cffi`, which we're only using for the `.ipc._linux` eventfd
stuff (now factored into `hotbaud` anyway).
- `greenback`, which requires `greenlet` which doesn't seem to be
wheeled yet
* on nixos the sdist build was failing due to lack of `g++` which
i don't care to figure out rn since we don't need `.devx` stuff
immediately for this subints prototype.
* [ ] we still need to adjust any dependent suites to skip.
Adjust `test_ringbuf` to skip on import failure.
Also project wide,
- pin us to py 3.13+ in prep for last-2-minor-version policy.
- drop `msgspec>=0.20.0`, the first release with py3.14 support.
Import and apply `cpu_scaling_factor()` from
`conftest`; bump base from 3.6 -> 4 and multiply
through so CI boxes with slow CPUs don't flake.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Deats,
- use `proc.poll() is None` in `sig_prog()` to
distinguish "still running" from exit code 0;
drop stale `breakpoint()` from fallback kill
path (would hang CI).
- add missing `raise` on the `RuntimeError` in
`async_main()` when no tpt bind addrs given.
- clean up stale uid entries from the registrar
`_registry` when addr eviction empties the
addr list.
- update `discovery.__init__` docstring to match
the new eager `._multiaddr` import.
- fix `registar` -> `registrar` typo in teardown
report log msg.
Review: PR #429 (Copilot)
https://github.com/goodboy/tractor/pull/429
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
`get_random()` can produce the same UDS filename for a given
pid+actor-state, so the "disjoint addrs" premise doesn't always hold.
Gate the `len(bound) >= 2` assertion on whether the registry and bind
addrs actually differ via `expect_disjoint`.
Also,
- drop unused `partial` import
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Retry signal delivery in `sig_prog()` up to `tries`
times (default 3) w/ `canc_timeout` sleep between
attempts; only fall back to `_KILL_SIGNAL` after all
retries exhaust. Bump default timeout 0.1 -> 0.2.
Also,
- `test_multi_nested_subactors_error_through_nurseries`
gives the first prompt iteration a 5s timeout even
on linux bc the initial crash sequence can be slow
to arrive at a `pdb` prompt
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Adjust all imports to match.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Provide a service-table parsing API for downstream projects (like
`piker`) to declare per-actor transport bind addresses as a config map
of actor-name -> multiaddr strings (e.g. from a TOML `[network]`
section).
Deats,
- `EndpointsTable` type alias: input `dict[str, list[str|tuple]]`.
- `ParsedEndpoints` type alias: output `dict[str, list[Address]]`.
- `parse_endpoints()` iterates the table and delegates each entry to the
existing `tractor.discovery._discovery.wrap_address()` helper, which
handles maddr strings, raw `(host, port)` tuples, and pre-wrapped
`Address` objs.
- UDS maddrs use the multiaddr spec name `/unix/...` (not tractor's
internal `/uds/` proto_key)
Also add new tests,
- 7 new pure unit tests (no trio runtime): TCP-only, mixed tpts,
unwrapped tuples, mixed str+tuple, unsupported proto (`/udp/`),
empty table, empty actor list
- all 22 multiaddr tests pass rn.
Prompt-IO:
ai/prompt-io/claude/20260413T205048Z_269d939c_prompt_io.md
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Add 9 test variants (6 fns) covering all three
`tpt_bind_addrs` code paths in `open_root_actor()`:
- registrar w/ explicit bind (eq, subset, disjoint)
- non-registrar w/ explicit bind (same/diff
bindspace) using `daemon` fixture
- non-registrar default random bind (baseline)
- maddr string input parsing
- registrar merge produces union
- `open_nursery()` forwards `tpt_bind_addrs`
Fix type-mixing bug at `_root.py:446` where the
registrar merge path did `set(Address + tuple)`,
preventing dedup and causing double-bind `OSError`.
Wrap `uw_reg_addrs` before the set union so both
sides are `Address` objs.
Also,
- add prompt-io output log for this session
- stage original prompt input for tracking
Prompt-IO: ai/prompt-io/claude/20260413T192116Z_f851f28_prompt_io.md
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Use `cpu_scaling_factor()` headroom in
`test_peer_spawns_and_cancels_service_subactor`'s `fail_after` to avoid
flaky timeouts on throttled CI runners. Rename `arbiter_addr=` ->
`registry_addrs=[..]` throughout `test_spawning` and
`test_task_broadcasting` suites to match the current `open_root_actor()`
/ `open_nursery()` API.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Drop the `.lstrip('/')` on the unix protocol value
so the lib-prepended `/` restores the absolute-path
semantics that `mk_maddr()` strips when encoding.
Pass `Path` components (not `str`) to `UDSAddress`.
Also, update all UDS test params to use absolute
paths (`/tmp/tractor_test/...`, `/tmp/tractor_rt/...`)
matching real runtime sockpath behavior; tighten
`test_parse_maddr_uds` to assert exact `filedir`.
Review: PR #429 (copilot-pull-request-reviewer[bot])
https://github.com/goodboy/tractor/pull/429#pullrequestreview-4018448152
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Cover `parse_maddr()` with unit tests for tcp/ipv4,
tcp/ipv6, uds, and unsupported-protocol error paths,
plus full `addr -> mk_maddr -> str -> parse_maddr`
roundtrip verification.
Adds,
- a `_maddr_to_tpt_proto` inverse-mapping assertion.
- an `wrap_address()` maddr-string acceptance test.
- a `test_reg_then_unreg_maddr` end-to-end suite which audits passing
the registry addr as multiaddr str through the entire runtime.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
All tests are registrar-actor integration scenarios
sharing intertwined helpers + `enable_modules=[__name__]`
task fns, so keep as one mod but rename to reflect
content. Now lives alongside `test_multiaddr.py` in
the new `tests/discovery/` subpkg.
Also,
- update 5 refs in `/run-tests` SKILL.md to match
the new path
- add `discovery/` subdir to the test directory
layout tree
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Cover `_tpt_proto_to_maddr` mapping, TCP (ipv4/ipv6),
UDS, unsupported `proto_key` error, and round-trip
re-parse for both transport types.
Deats,
- new `tests/discovery/` subpkg w/ empty `__init__.py`
- `test_tpt_proto_to_maddr_mapping`: verify `tcp` and
`uds` entries
- `test_mk_maddr_tcp_ipv4`: full assertion on
`/ip4/127.0.0.1/tcp/1234` incl protocol iteration
- `test_mk_maddr_tcp_ipv6`: verify `/ip6/::1/tcp/5678`
- `test_mk_maddr_uds`: relative `filedir` bc the
multiaddr parser rejects double-slash from abs paths
- `test_mk_maddr_unsupported_proto_key`: `ValueError`
on `proto_key='quic'` via `SimpleNamespace` mock
- `test_mk_maddr_roundtrip`: parametrized over tcp +
uds, re-parse `str(maddr)` back through `Multiaddr`
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Parametrize `test_loglevel_propagated_to_subactor`
across `'debug'`, `'cancel'`, `'critical'` levels
(was hardcoded to just `'critical'`) and move it
above the parent-main tests for logical grouping.
Also,
- add `start_method: str` annotations throughout
- use `portal.wait_for_result()` in
`test_most_beautiful_word` (replaces `.result()`)
- expand mod docstring to describe test coverage
- reformat `check_parent_main_inheritance` docstr
Review: PR #438 (Copilot)
https://github.com/goodboy/tractor/pull/438
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Replace the subproc-based test harness with inline
`tractor.open_nursery()` calls that directly check
`actor._parent_main_data` instead of comparing
`__main__.__name__` across a process boundary
(which is a no-op under pytest bc the parent
`__main__` is `pytest.__main__`).
Deats,
- delete `tests/spawn_test_support/` pkg (3 files)
- add `check_parent_main_inheritance()` helper fn
that asserts on `_parent_main_data` emptiness
- rewrite both `run_in_actor` and `start_actor`
parent-main tests as inline async fns
- drop `tmp_path` fixture and unused imports
Review: PR #434 (goodboy, Copilot)
https://github.com/goodboy/tractor/pull/434
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Move the subprocess probe into dedicated spawn test support files so the inheritance tests cover the real __main__ replay path without monkeypatching or inline script strings.
Clean up mutable defaults, give parent-main bootstrap data a named type, and add direct start_actor coverage so the opt-out change is clearer to review.
Use `inherit_parent_main` across the actor APIs and helper to better describe the behavior, and restore the reviewer note at child bootstrap where the inherited `__main__` data is copied from `SpawnSpec`.
Keep actor-owned parent-main capture and let `_mp_figure_out_main()` decide whether to return `__main__` bootstrap data, avoiding the extra SpawnSpec plumbing while preserving the per-actor flag.
Keep trio child bootstrap data in the spawn handshake instead of stashing it on Actor state so the replay opt-out stays explicit and avoids stale-looking runtime fields.
Let actor callers skip replaying the parent __main__ during child startup so downstream integrations can avoid inheriting incompatible bootstrap state without changing the default spawn behavior.
The per-`ctx_key` locking fix in f086222d intended to resolve the
teardown race reproduced by the new test suite, so the test SHOULD now
pass. TLDR, it doesn't Bp
Also add `collapse_eg()` to the test's ctx-manager stack so that when
run with `pytest <...> --tpdb` we'll actually `pdb`-REPL the RTE when it
hits (previously an assert-error).
(this commit-msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Reproduce the piker `open_cached_client('kraken')` scenario: identical
`ctx_key` callers share one cached resource, and a new task re-enters
during `__aexit__` — hitting `assert not resources.get()` bc `values`
was popped but `resources` wasn't yet.
Deats,
- `test_moc_reentry_during_teardown` uses an `in_aexit` event to
deterministically land in the teardown window.
- marked `xfail(raises=AssertionError)` against unpatched code (fix in
`9e49eddd` or wtv lands on the `maybe_open_ctx_locking` or thereafter
patch branch).
Also, add prompt-io log for the session.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Prompt-IO: ai/prompt-io/claude/20260406T193125Z_85f9c5d_prompt_io.md
Add `test_per_ctx_key_resource_lifecycle` to verify that per-key user
tracking correctly tears down resources independently - exercises the
fix from 02b2ef18 where a global `_Cache.users` counter caused stale
cache hits when the same `acm_func` was called with different kwargs.
Also, add a paired `acm_with_resource()` helper `@acm` that yields its
`resource_id` for per-key testing in the above suite.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Prompt-IO: ai/prompt-io/claude/20260406T172848Z_02b2ef1_prompt_io.md
Namely with multiple pre-sleep `delay`-parametrizations before either,
- parent-scope cancel-calling (as originally) or,
- depending on the new `cancel_by_cs: bool` suite parameter, optionally
just immediately exiting from (the newly named)
`maybe_cancel_outer_cs()` a checkpoint.
In the latter case we ensure we **don't** inf sleep to avoid leaking
those tasks into the `Actor._service_tn` (though we should really have
a better soln for this)..
Deats,
- make `cs` args optional and adjust internal logic to match.
- add some notes around various edge cases and issues with using the
actor-service-tn as the scope by default.
- Remove leftover `await an.cancel()` in
`test_registered_custom_err_relayed`; the
nursery already cancels on scope exit.
- Fix `This document` -> `This documents` typo in
`test_unregistered_err_still_relayed` docstring.
Review: PR #426 (Copilot)
https://github.com/goodboy/tractor/pull/426
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Add a teensie unit test to match.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Drop the `xfail` test and instead add a new one that ensures the
`tractor._exceptions` fixes enable graceful relay of
remote-but-unregistered error types via the unboxing of just the
`rae.src_type_str/boxed_type_str` content. The test also ensures
a warning is included with remote error content indicating the user
should register their error type for effective cross-actor re-raising.
Deats,
- add `test_unregistered_err_still_relayed`: verify the
`RemoteActorError` IS raised with `.boxed_type`
as `None` but `.src_type_str`, `.boxed_type_str`,
and `.tb_str` all preserved from the IPC msg.
- drop `test_unregistered_boxed_type_resolution_xfail`
since the new above case covers it and we don't need to have
an effectively entirely repeated test just with an inverse assert
as it's last line..
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Verify registered custom error types round-trip correctly over IPC via
`reg_err_types()` + `get_err_type()`.
Deats,
- `TestRegErrTypesPlumbing`: 5 unit tests for the type-registry plumbing
(register, lookup, builtins, tractor-native types, unregistered
returns `None`)
- `test_registered_custom_err_relayed`: IPC end-to-end for a registered
`CustomAppError` checking `.boxed_type`, `.src_type`, and `.tb_str`
- `test_registered_another_err_relayed`: same for `AnotherAppError`
(multi-type coverage)
- `test_unregistered_custom_err_fails_lookup`: `xfail` documenting that
`.boxed_type` can't resolve without `reg_err_types()` registration
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
The `open_actor_cluster()` teardown hangs
intermittently on UDS when `gather_contexts(mngrs=())`
raises `ValueError` mid-setup; likely a race in the
actor-nursery cleanup vs UDS socket shutdown. TCP
passes reliably (5/5 runs).
- Add `tpt_proto` fixture param to the test
- `pytest.skip()` on UDS with a TODO for deeper
investigation of `._clustering`/`._supervise`
teardown paths
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code