New `ai/conc-anal/spawn_time_boot_death_dup_name_issue.md`
documenting the spawn-time rc=2 race under rapid
same-name spawning against a forkserver + registrar
— the `wait_for_peer_or_proc_death` helper now surfaces
the death instead of parking forever on the handshake
wait.
Also,
- extract inline `xfail` into module-level
`_DOGGY_BOOT_RACE_XFAIL` marker.
- apply it to `n_dups=8` too (previously bare) bc
larger N widens the race window enough to fire
occasionally.
- link to tracking issue #456.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 92443dc4ef)
Document the intermittent connect-refused failure in the registrar
daemon test — root cause is the `daemon` fixture's blind `time.sleep()`
readiness gate racing against the subproc's `bind()`/ `listen()`
completion. Distinct from the cancel- cascade `TooSlowError` flake
class.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 29f9928524)
With a seminal patch fixing `trio`'s `WakeupSocketpair.drain()` which
can busy-loop due to lack of handling `EOF`.
New `tractor.trionics.patches` subpkg housing defensive monkey-patches
for upstream `trio` bugs we've encountered while running `tractor`
— particularly as of recent, fork-survival edge cases that haven't been
filed/fixed upstream yet. Each patch is idempotent, version-gated via
`is_needed()`, and carries a `# REMOVE WHEN:` marker pointing at the
upstream release whose adoption allows deletion.
Subpkg layout + per-patch contract documented in
`tractor/trionics/patches/README.md` — `apply()` / `is_needed()`
/ `repro()` API, registry pattern via `_PATCHES` in `__init__.py`,
single-call entry point `apply_all()`.
First patch, `_wakeup_socketpair`:
- `trio`'s `WakeupSocketpair.drain()` loops on `recv(64KB)` and exits
ONLY on `BlockingIOError`, NEVER on `recv() == b''` (peer-closed FIN).
- under `fork()`-spawning backends the COW-inherited socketpair fds
& `_close_inherited_fds()` teardown can leave a `WakeupSocketpair`
instance whose write-end is closed, and `drain()` then **spins forever
in C with no Python checkpoints**,
- this obviously burns 100% CPU and no signal delivery.
Standalone repro:
from trio._core._wakeup_socketpair import WakeupSocketpair
ws = WakeupSocketpair()
ws.write_sock.close()
ws.drain() # spins forever
Patch is one-line — break the drain loop on b'' EOF.
Manifested as two distinct test failures:
- `tests/test_multi_program.py::test_register_duplicate_name` hung at
100% CPU on the busy-loop directly (fork child's worker thread)
- `tests/test_infected_asyncio.py::test_aio_simple_error` Mode-A
deadlock — busy-loop wedged trio's scheduler inside `start_guest_run`,
both threads parked in `epoll_wait`, no TCP connect-back to parent
ever happened.
Same patch fixes both. Restored 99.7% pass rate on full
suite under `--spawn-backend=main_thread_forkserver`
(was hanging indefinitely before).
Wired into `tractor._child._actor_child_main` via `apply_all()` BEFORE
any trio runtime init. Harmless on non-fork backends.
Conc-anal write-ups, including strace + py-spy evidence:
- `ai/conc-anal/trio_wakeup_socketpair_busy_loop_under_fork_issue.md`
- `ai/conc-anal/infected_asyncio_under_main_thread_forkserver_hang_issue.md`
Regression tests in `tests/trionics/test_patches.py`: each test asserts
(a) the bug exists pre-patch (or is fixed upstream — skip cleanly), (b)
the patch fixes it with a SIGALRM wall-clock cap so a regression hangs
loud instead of silently.
TODO:
- [ ] file the upstream `python-trio/trio` issue + PR.
- [ ] use the `repro()` callable in `_wakeup_socketpair.py` IS the issue
body's evidence section.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 0ef549fadb)
(factored: dropped spawn-backend-only paths: ai/conc-anal/infected_asyncio_under_main_thread_forkserver_hang_issue.md)
During the Phase A extraction of `trio_proc()` out of
`spawn._spawn` into its own submod, the
`debug.maybe_wait_for_debugger(child_in_debug=...)` call site in
the hard-reap `finally` got refactored from the original
`_runtime_vars.get('_debug_mode', ...)` (the fn parameter — the
dict that was constructed by the *parent* for the *child*'s
`SpawnSpec`) to `get_runtime_vars().get(...)` (a global getter that
returns the *parent's* live `_state`). Those are semantically
different — the first asks "is the child we just spawned in debug
mode?", the second asks "are *we* in debug mode?". Under
mixed-debug-mode trees the swap can incorrectly skip (or
unnecessarily delay) the debugger-lock wait during teardown.
Revert to the fn-parameter lookup and add an inline `NOTE` comment
calling out the distinction so it's harder to regress again.
Deats,
- `spawn/_trio.py`: `child_in_debug=get_runtime_vars().get(...)` →
`child_in_debug=_runtime_vars.get(...)` at the
`debug.maybe_wait_for_debugger(...)` call in the hard-reap block;
add 4-line `NOTE` explaining the parent-vs-child distinction.
- `spawn/__init__.py`: drop trailing whitespace after the
`'mp_forkserver'` docstring bullet.
- `ai/prompt-io/prompts/subints_spawner.md`: drop duplicated `with`
in `"as with with subprocs"` prose (copilot grammar catch).
Review: PR #444 (Copilot)
https://github.com/goodboy/tractor/pull/444#pullrequestreview-4165928469
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Log the `claude-opus-4-7` design session that produced the phased plan
(A: modularize `_spawn`, B: `_subint` backend, C: harness) and concrete
Phase A file-split for #379. Substantive bc the plan directly drives
upcoming impl.
Prompt-IO: ai/prompt-io/claude/20260417T034918Z_9703210_prompt_io.md
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Replace verbose inline code dumps in `.raw.md`
entries with terse summaries and `git diff`
cmd references. Add `diff_cmd` metadata to each
entry's YAML frontmatter so readers can reproduce
the actual output diff.
Also,
- rename `multiaddr_declare_eps.md_` -> `.md`
(drop trailing `_` suffix)
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
New locality-aware addr preference for multihomed
actors: UDS > local TCP > remote TCP. Uses
`ipaddress` + `socket.getaddrinfo()` to detect
whether a `TCPAddress` is on the local host.
Deats,
- `_is_local_addr()` checks loopback or
same-host IPs via interface enumeration
- `prefer_addr()` classifies an addr list into
three tiers and picks the latest entry from
the highest-priority non-empty tier
- `query_actor()` and `wait_for_actor()` now
call `prefer_addr()` instead of grabbing
`addrs[-1]` or a single pre-selected addr
Also,
- `Registrar.find_actor()` returns full
`list[UnwrappedAddress]|None` so callers can
apply transport preference
Prompt-IO: ai/prompt-io/claude/20260414T163300Z_befedc49_prompt_io.md
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Provide a service-table parsing API for downstream projects (like
`piker`) to declare per-actor transport bind addresses as a config map
of actor-name -> multiaddr strings (e.g. from a TOML `[network]`
section).
Deats,
- `EndpointsTable` type alias: input `dict[str, list[str|tuple]]`.
- `ParsedEndpoints` type alias: output `dict[str, list[Address]]`.
- `parse_endpoints()` iterates the table and delegates each entry to the
existing `tractor.discovery._discovery.wrap_address()` helper, which
handles maddr strings, raw `(host, port)` tuples, and pre-wrapped
`Address` objs.
- UDS maddrs use the multiaddr spec name `/unix/...` (not tractor's
internal `/uds/` proto_key)
Also add new tests,
- 7 new pure unit tests (no trio runtime): TCP-only, mixed tpts,
unwrapped tuples, mixed str+tuple, unsupported proto (`/udp/`),
empty table, empty actor list
- all 22 multiaddr tests pass rn.
Prompt-IO:
ai/prompt-io/claude/20260413T205048Z_269d939c_prompt_io.md
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Add 9 test variants (6 fns) covering all three
`tpt_bind_addrs` code paths in `open_root_actor()`:
- registrar w/ explicit bind (eq, subset, disjoint)
- non-registrar w/ explicit bind (same/diff
bindspace) using `daemon` fixture
- non-registrar default random bind (baseline)
- maddr string input parsing
- registrar merge produces union
- `open_nursery()` forwards `tpt_bind_addrs`
Fix type-mixing bug at `_root.py:446` where the
registrar merge path did `set(Address + tuple)`,
preventing dedup and causing double-bind `OSError`.
Wrap `uw_reg_addrs` before the set union so both
sides are `Address` objs.
Also,
- add prompt-io output log for this session
- stage original prompt input for tracking
Prompt-IO: ai/prompt-io/claude/20260413T192116Z_f851f28_prompt_io.md
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Documents the diagnostic session tracing why
per-`ctx_key` locking alone doesn't close the
`_Cache.run_ctx` teardown race — the lock pops
in the exiting caller's task but resource cleanup
runs in the `run_ctx` task inside `service_tn`.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Reproduce the piker `open_cached_client('kraken')` scenario: identical
`ctx_key` callers share one cached resource, and a new task re-enters
during `__aexit__` — hitting `assert not resources.get()` bc `values`
was popped but `resources` wasn't yet.
Deats,
- `test_moc_reentry_during_teardown` uses an `in_aexit` event to
deterministically land in the teardown window.
- marked `xfail(raises=AssertionError)` against unpatched code (fix in
`9e49eddd` or wtv lands on the `maybe_open_ctx_locking` or thereafter
patch branch).
Also, add prompt-io log for the session.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Prompt-IO: ai/prompt-io/claude/20260406T193125Z_85f9c5d_prompt_io.md
Add `test_per_ctx_key_resource_lifecycle` to verify that per-key user
tracking correctly tears down resources independently - exercises the
fix from 02b2ef18 where a global `_Cache.users` counter caused stale
cache hits when the same `acm_func` was called with different kwargs.
Also, add a paired `acm_with_resource()` helper `@acm` that yields its
`resource_id` for per-key testing in the above suite.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Prompt-IO: ai/prompt-io/claude/20260406T172848Z_02b2ef1_prompt_io.md