trio 0.29 → 0.33 lock bump (c7741bba) slowed the
depth=3 cancel-cascade in `test_nested_multierrors`
from <6s to ~7-8s; the 6s deadline was firing and its
`Cancelled(source='deadline')` (trio 0.33's new
cancel-reason metadata) collapsed a BEG branch,
breaking the `RemoteActorError` assertion downstream.
- Split the `('trio', _)` case-match into per-depth
arms: `('trio', 1)` keeps 6s (still finishes in
~3s); `('trio', 3)` → 12s.
- Updated inline NOTE explains the version pivot +
links the tracking issue
`ai/conc-anal/trio_033_cancel_cascade_slowdown_depth3_issue.md`.
- Existing MTF/`subint_forkserver` budgets unchanged.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit ea67f1b67b)
A `trio.Nursery.start()`-style wrapper around
`trio.run_process()` that surfaces rc!=0 errors
deterministically, ALWAYS isolates the parent
controlling-tty, and optionally live-relays the child's
std-streams to `log.<level>` per-line. Suits both
short-lived test-runners + long-lived daemons.
`supervise_run_process()`,
- Deterministic rc!=0: pass `check=False` to `trio`
and do our OWN post-drain rc-check from the
supervisor coro body AFTER `own_tn.__aexit__` — NOT
inside the internal nursery, since that would
race-cancel the still-draining relay reader and lose
stderr lines. (Re)build + raise a BARE
`subprocess.CalledProcessError`: `.stderr=` for
programmatic callers + an `add_note()`'d
`|_.stderr:` block for human teardown logs. No
nursery-eg-wrapped CPE to `collapse_eg` around.
- Parent controlling-tty isolation: `stdin=DEVNULL`
always, `stdout=DEVNULL` unless relayed/overridden
(via `stdout=` kwarg w/ `_UNSET` sentinel so explicit
`None` = inherit still works). Prevents a spawned
program from clobbering the launching tty's scrollback
w/ control-seqs.
- Live per-line relay: `relay_stdout=True`/
`relay_stderr=True` → relayed to `log.<relay_level>`
(default `'io'`, our custom level 21). Picked to sort
just above stdlib `INFO`=20 so it shows at usual
`info`/`devx` levels yet stays separately filterable;
`runtime`=15 was REJECTED as a default since it'd be
silently filtered at usual verbosity — footgun for
daemon supervisors whose whole point is visibility.
STREAMED, not buffered-until-exit.
- Non-blocking `tn.start()` semantics: live
`trio.Process` handed up via
`task_status.started()` immediately (else
`tn.start()` would block till child exit, losing
the long-lived-daemon use case). Supervise/relay bg
tasks run to completion in this coro.
- `**run_process_kwargs` forwarded verbatim (env, shell,
cwd, start_new_session, executable, ...); MANAGED keys
(`stdin`/`stdout`/`stderr`/`check`) win on conflict.
- Crash-handling layer intentionally NOT baked in —
compose `maybe_open_crash_handler()` ON TOP at the
call-site.
`_relay_stream_lines()` helper,
- Concurrent pipe-drain reader. MANDATORY whenever piping
w/o `capture_*` since nothing else drains the OS pipe —
child blocks on `write()` once kernel buf (~64KiB) fills
→ deadlock.
- Modes (combine freely): `emit`-only live relay,
`accum`-only silent drain+capture (for the CPE note),
or both. Per-line splitting handles cross-chunk
residuals + flushes any trailing un-newline-term'd line
at EOF.
`_add_stderr_note()` helper,
- Attaches an indented `|_.stderr:` note to a CPE via
`add_note()` for legible rc!=0 reporting at teardown.
Tests (`tests/trionics/test_subproc.py`),
- Hermetic `trio`-only (no actor-runtime).
- `test_stdout_relayed_per_line`: per-line stdout relay.
- `test_parent_tty_isolated`: child fd1 is OUR pipe (no
`/dev/pts/*`), fd0 pinned to `/dev/null`.
- `test_no_deadlock_on_big_unnewlined_output`: 200KiB
no-newline output completes under `fail_after(2)` —
exercises the concurrent drain (without it, the child
blocks at ~64KiB).
- `test_stderr_relay_and_cpe_rebuild`: rc!=0 w/
`relay_stderr=True` → bare `CalledProcessError` w/ the
`.stderr` note + per-line live relay.
- `test_nonrelay_cpe_note`: rc!=0 w/o relay → same
deterministic post-drain CPE w/ `.stderr` note (silent
drain+capture path).
Re-export `supervise_run_process` from `tractor.trionics`.
Prompt-IO: ai/prompt-io/claude/20260601T231429Z_0e3e008b_prompt_io.md
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit f595acc76c)
Follow-up note documenting why the deeper "Route B" fix
for `LogSpec`/`apply_logspec()` true per-leaf-MODULE level
control was NOT taken — in favor of the smaller
sub-PACKAGE fix that shipped in 9c36363b.
Doc covers,
- Status: what 9c36363b already gives (per-sub-pkg
control at any nesting depth, `devx.debug` ≠ `devx`)
vs. what remains unaddressed (per-leaf-mod levels,
top-level lib mods like `tractor.to_asyncio` on the
root logger).
- "Route B" sketch: make logger *identity* the full
dotted module path; mv the cosmetic leaf-trim out of
logger-naming into the *formatter's* `{name}`
rendering.
- 6 breaking-change costs: every logger name changes,
formatter rewrite, propagation/double-emit surface
grows, level-inheritance semantics shift,
`modden`/`piker` contract churn, `get_logger()`
refactor risk.
- Migration plan if pursued: extract a pure
`_mk_logger_name()` helper w/ an exhaustive name-shape
test matrix, swap `get_logger()` to use it for
identity, swap formatter to use the display string,
golden-diff rendered headers, coordinate w/
downstreams.
- "Route A" alternative: a `logging.Filter` keyed on
`record.module`/`pathname` for per-leaf control w/o
name churn — lower risk, narrower power.
- Recommendation: defer Route B; prefer Route A if
per-leaf is needed soon; the shipped sub-PKG fix
covers the common ask.
Lives under `ai/tooling-todos/` since it's a deferred-
work decision record, not a triage/conc-anal doc.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 5b3c2e3762)
New `ai/conc-anal/spawn_time_boot_death_dup_name_issue.md`
documenting the spawn-time rc=2 race under rapid
same-name spawning against a forkserver + registrar
— the `wait_for_peer_or_proc_death` helper now surfaces
the death instead of parking forever on the handshake
wait.
Also,
- extract inline `xfail` into module-level
`_DOGGY_BOOT_RACE_XFAIL` marker.
- apply it to `n_dups=8` too (previously bare) bc
larger N widens the race window enough to fire
occasionally.
- link to tracking issue #456.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 92443dc4ef)
Document the intermittent connect-refused failure in the registrar
daemon test — root cause is the `daemon` fixture's blind `time.sleep()`
readiness gate racing against the subproc's `bind()`/ `listen()`
completion. Distinct from the cancel- cascade `TooSlowError` flake
class.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 29f9928524)
With a seminal patch fixing `trio`'s `WakeupSocketpair.drain()` which
can busy-loop due to lack of handling `EOF`.
New `tractor.trionics.patches` subpkg housing defensive monkey-patches
for upstream `trio` bugs we've encountered while running `tractor`
— particularly as of recent, fork-survival edge cases that haven't been
filed/fixed upstream yet. Each patch is idempotent, version-gated via
`is_needed()`, and carries a `# REMOVE WHEN:` marker pointing at the
upstream release whose adoption allows deletion.
Subpkg layout + per-patch contract documented in
`tractor/trionics/patches/README.md` — `apply()` / `is_needed()`
/ `repro()` API, registry pattern via `_PATCHES` in `__init__.py`,
single-call entry point `apply_all()`.
First patch, `_wakeup_socketpair`:
- `trio`'s `WakeupSocketpair.drain()` loops on `recv(64KB)` and exits
ONLY on `BlockingIOError`, NEVER on `recv() == b''` (peer-closed FIN).
- under `fork()`-spawning backends the COW-inherited socketpair fds
& `_close_inherited_fds()` teardown can leave a `WakeupSocketpair`
instance whose write-end is closed, and `drain()` then **spins forever
in C with no Python checkpoints**,
- this obviously burns 100% CPU and no signal delivery.
Standalone repro:
from trio._core._wakeup_socketpair import WakeupSocketpair
ws = WakeupSocketpair()
ws.write_sock.close()
ws.drain() # spins forever
Patch is one-line — break the drain loop on b'' EOF.
Manifested as two distinct test failures:
- `tests/test_multi_program.py::test_register_duplicate_name` hung at
100% CPU on the busy-loop directly (fork child's worker thread)
- `tests/test_infected_asyncio.py::test_aio_simple_error` Mode-A
deadlock — busy-loop wedged trio's scheduler inside `start_guest_run`,
both threads parked in `epoll_wait`, no TCP connect-back to parent
ever happened.
Same patch fixes both. Restored 99.7% pass rate on full
suite under `--spawn-backend=main_thread_forkserver`
(was hanging indefinitely before).
Wired into `tractor._child._actor_child_main` via `apply_all()` BEFORE
any trio runtime init. Harmless on non-fork backends.
Conc-anal write-ups, including strace + py-spy evidence:
- `ai/conc-anal/trio_wakeup_socketpair_busy_loop_under_fork_issue.md`
- `ai/conc-anal/infected_asyncio_under_main_thread_forkserver_hang_issue.md`
Regression tests in `tests/trionics/test_patches.py`: each test asserts
(a) the bug exists pre-patch (or is fixed upstream — skip cleanly), (b)
the patch fixes it with a SIGALRM wall-clock cap so a regression hangs
loud instead of silently.
TODO:
- [ ] file the upstream `python-trio/trio` issue + PR.
- [ ] use the `repro()` callable in `_wakeup_socketpair.py` IS the issue
body's evidence section.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 0ef549fadb)
(factored: dropped spawn-backend-only paths: ai/conc-anal/infected_asyncio_under_main_thread_forkserver_hang_issue.md)
During the Phase A extraction of `trio_proc()` out of
`spawn._spawn` into its own submod, the
`debug.maybe_wait_for_debugger(child_in_debug=...)` call site in
the hard-reap `finally` got refactored from the original
`_runtime_vars.get('_debug_mode', ...)` (the fn parameter — the
dict that was constructed by the *parent* for the *child*'s
`SpawnSpec`) to `get_runtime_vars().get(...)` (a global getter that
returns the *parent's* live `_state`). Those are semantically
different — the first asks "is the child we just spawned in debug
mode?", the second asks "are *we* in debug mode?". Under
mixed-debug-mode trees the swap can incorrectly skip (or
unnecessarily delay) the debugger-lock wait during teardown.
Revert to the fn-parameter lookup and add an inline `NOTE` comment
calling out the distinction so it's harder to regress again.
Deats,
- `spawn/_trio.py`: `child_in_debug=get_runtime_vars().get(...)` →
`child_in_debug=_runtime_vars.get(...)` at the
`debug.maybe_wait_for_debugger(...)` call in the hard-reap block;
add 4-line `NOTE` explaining the parent-vs-child distinction.
- `spawn/__init__.py`: drop trailing whitespace after the
`'mp_forkserver'` docstring bullet.
- `ai/prompt-io/prompts/subints_spawner.md`: drop duplicated `with`
in `"as with with subprocs"` prose (copilot grammar catch).
Review: PR #444 (Copilot)
https://github.com/goodboy/tractor/pull/444#pullrequestreview-4165928469
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Log the `claude-opus-4-7` design session that produced the phased plan
(A: modularize `_spawn`, B: `_subint` backend, C: harness) and concrete
Phase A file-split for #379. Substantive bc the plan directly drives
upcoming impl.
Prompt-IO: ai/prompt-io/claude/20260417T034918Z_9703210_prompt_io.md
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Replace verbose inline code dumps in `.raw.md`
entries with terse summaries and `git diff`
cmd references. Add `diff_cmd` metadata to each
entry's YAML frontmatter so readers can reproduce
the actual output diff.
Also,
- rename `multiaddr_declare_eps.md_` -> `.md`
(drop trailing `_` suffix)
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
New locality-aware addr preference for multihomed
actors: UDS > local TCP > remote TCP. Uses
`ipaddress` + `socket.getaddrinfo()` to detect
whether a `TCPAddress` is on the local host.
Deats,
- `_is_local_addr()` checks loopback or
same-host IPs via interface enumeration
- `prefer_addr()` classifies an addr list into
three tiers and picks the latest entry from
the highest-priority non-empty tier
- `query_actor()` and `wait_for_actor()` now
call `prefer_addr()` instead of grabbing
`addrs[-1]` or a single pre-selected addr
Also,
- `Registrar.find_actor()` returns full
`list[UnwrappedAddress]|None` so callers can
apply transport preference
Prompt-IO: ai/prompt-io/claude/20260414T163300Z_befedc49_prompt_io.md
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Provide a service-table parsing API for downstream projects (like
`piker`) to declare per-actor transport bind addresses as a config map
of actor-name -> multiaddr strings (e.g. from a TOML `[network]`
section).
Deats,
- `EndpointsTable` type alias: input `dict[str, list[str|tuple]]`.
- `ParsedEndpoints` type alias: output `dict[str, list[Address]]`.
- `parse_endpoints()` iterates the table and delegates each entry to the
existing `tractor.discovery._discovery.wrap_address()` helper, which
handles maddr strings, raw `(host, port)` tuples, and pre-wrapped
`Address` objs.
- UDS maddrs use the multiaddr spec name `/unix/...` (not tractor's
internal `/uds/` proto_key)
Also add new tests,
- 7 new pure unit tests (no trio runtime): TCP-only, mixed tpts,
unwrapped tuples, mixed str+tuple, unsupported proto (`/udp/`),
empty table, empty actor list
- all 22 multiaddr tests pass rn.
Prompt-IO:
ai/prompt-io/claude/20260413T205048Z_269d939c_prompt_io.md
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Add 9 test variants (6 fns) covering all three
`tpt_bind_addrs` code paths in `open_root_actor()`:
- registrar w/ explicit bind (eq, subset, disjoint)
- non-registrar w/ explicit bind (same/diff
bindspace) using `daemon` fixture
- non-registrar default random bind (baseline)
- maddr string input parsing
- registrar merge produces union
- `open_nursery()` forwards `tpt_bind_addrs`
Fix type-mixing bug at `_root.py:446` where the
registrar merge path did `set(Address + tuple)`,
preventing dedup and causing double-bind `OSError`.
Wrap `uw_reg_addrs` before the set union so both
sides are `Address` objs.
Also,
- add prompt-io output log for this session
- stage original prompt input for tracking
Prompt-IO: ai/prompt-io/claude/20260413T192116Z_f851f28_prompt_io.md
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Documents the diagnostic session tracing why
per-`ctx_key` locking alone doesn't close the
`_Cache.run_ctx` teardown race — the lock pops
in the exiting caller's task but resource cleanup
runs in the `run_ctx` task inside `service_tn`.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Reproduce the piker `open_cached_client('kraken')` scenario: identical
`ctx_key` callers share one cached resource, and a new task re-enters
during `__aexit__` — hitting `assert not resources.get()` bc `values`
was popped but `resources` wasn't yet.
Deats,
- `test_moc_reentry_during_teardown` uses an `in_aexit` event to
deterministically land in the teardown window.
- marked `xfail(raises=AssertionError)` against unpatched code (fix in
`9e49eddd` or wtv lands on the `maybe_open_ctx_locking` or thereafter
patch branch).
Also, add prompt-io log for the session.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Prompt-IO: ai/prompt-io/claude/20260406T193125Z_85f9c5d_prompt_io.md
Add `test_per_ctx_key_resource_lifecycle` to verify that per-key user
tracking correctly tears down resources independently - exercises the
fix from 02b2ef18 where a global `_Cache.users` counter caused stale
cache hits when the same `acm_func` was called with different kwargs.
Also, add a paired `acm_with_resource()` helper `@acm` that yields its
`resource_id` for per-key testing in the above suite.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Prompt-IO: ai/prompt-io/claude/20260406T172848Z_02b2ef1_prompt_io.md