From c041518bdb4aa25fb182ace4bb8cb8c7d49379ab Mon Sep 17 00:00:00 2001 From: goodboy Date: Sat, 18 Apr 2026 16:04:57 -0400 Subject: [PATCH] Add prompt-IO log for subint destroy-race fix Log the `claude-opus-4-7` session that produced the `_subint.py` dedicated-thread fix (`26fb8206`). Substantive bc the patch was entirely AI-generated; raw log also preserves the CPython-internals research informing Phase B.3 hard-kill work. Prompt-IO: ai/prompt-io/claude/20260418T042526Z_26fb820_prompt_io.md (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code --- .../20260418T042526Z_26fb820_prompt_io.md | 117 ++++++++++ .../20260418T042526Z_26fb820_prompt_io.raw.md | 220 ++++++++++++++++++ 2 files changed, 337 insertions(+) create mode 100644 ai/prompt-io/claude/20260418T042526Z_26fb820_prompt_io.md create mode 100644 ai/prompt-io/claude/20260418T042526Z_26fb820_prompt_io.raw.md diff --git a/ai/prompt-io/claude/20260418T042526Z_26fb820_prompt_io.md b/ai/prompt-io/claude/20260418T042526Z_26fb820_prompt_io.md new file mode 100644 index 00000000..0cf0e41d --- /dev/null +++ b/ai/prompt-io/claude/20260418T042526Z_26fb820_prompt_io.md @@ -0,0 +1,117 @@ +--- +model: claude-opus-4-7[1m] +service: claude +session: subints-phase-b2-destroy-race-fix +timestamp: 2026-04-18T04:25:26Z +git_ref: 26fb820 +scope: code +substantive: true +raw_file: 20260418T042526Z_26fb820_prompt_io.raw.md +--- + +## Prompt + +Follow-up to Phase B.2 (`5cd6df58`) after the user +observed intermittent mid-suite hangs when running +the tractor test suite under `--spawn-backend=subint` +on py3.14. The specific sequence of prompts over +several turns: + +1. User pointed at the `test_context_stream_semantics.py` + suite as the first thing to make run clean under + `--spawn-backend=subint`. +2. After a series of `timeout`-terminated runs that + gave no diagnostic info, user nudged me to stop + relying on `timeout` and get actual runtime + diagnostics ("the suite hangs indefinitely, so i + don't think this `timeout 30` is helping you at + all.."). Switched to + `faulthandler.dump_traceback_later(...)` and a + resource-tracker fixture to rule out leaks. +3. Captured a stack pinning the hang on + `_interpreters.destroy(interp_id)` in the subint + teardown finally block. +4. Proposed dedicated-OS-thread fix. User greenlit. +5. Implemented + verified on-worktree; user needed + to be pointed at the *worktree*'s `./py313` venv + because bare `pytest` was picking up the main + repo's venv (running un-patched `_subint.py`) and + still hanging. + +Running theme over the whole exchange: this patch +only closes the *destroy race*. The user and I also +traced through the deeper cancellation story — SIGINT +can't reach subints, legacy-mode shares the GIL, +portal-cancel dies when the IPC channel is already +broken — and agreed the next step is a bounded +hard-kill in `subint_proc`'s teardown plus a +dedicated cancellation test suite. Those land as +separate commits. + +## Response summary + +Produced the `tractor/spawn/_subint.py` patch landed +as commit `26fb8206` ("Fix subint destroy race via +dedicated OS thread"). One file, +110/-84 LOC. + +Mechanism: swap `trio.to_thread.run_sync(_interpreters +.exec, ...)` for a plain `threading.Thread(target=... +, daemon=False)`. The trio thread cache recycles +workers — so the OS thread that ran `_interpreters +.exec()` remained alive in the cache holding a +stale subint tstate, blocking +`_interpreters.destroy()` in the finally indefinitely. +A dedicated one-shot thread exits naturally after +the sync target returns, releasing tstate and +unblocking destroy. + +Coordination across the trio↔thread boundary: +- `trio.lowlevel.current_trio_token()` captured at + `subint_proc` entry +- driver thread signals `subint_exited.set()` back + to parent trio via `trio.from_thread.run_sync(..., + trio_token=token)` (synchronous from the thread's + POV; the call returns after trio has run `.set()`) +- `trio.RunFinishedError` swallowed in that path for + the process-teardown case where parent trio already + exited +- teardown `finally` off-loads the sync + `driver_thread.join()` via `to_thread.run_sync` (a + cache thread carries no subint tstate — safe) + +## Files changed + +See `git diff 26fb820~1..26fb820 --stat`: + +``` + tractor/spawn/_subint.py | 194 +++++++++++++++++++------------ + 1 file changed, 110 insertions(+), 84 deletions(-) +``` + +Validation: +- `test_parent_cancels[chk_ctx_result_before_exit=True- + cancel_method=ctx-child_returns_early=False]` + (the specific test that was hanging for the user) + — passed in 1.06s. +- Full `tests/test_context_stream_semantics.py` under + subint — 61 passed in 100.35s (clean-cache re-run: + 100.82s). +- Trio backend regression subset — 69 passed / 1 + skipped / 89.19s — no regressions from this change. + +## Files changed + +Beyond the `_subint.py` patch, the raw log also +records the cancellation-semantics research that +spanned this conversation but did not ship as code +in *this* commit. Preserving it inline under "Non- +code output" because it directly informs the +Phase B.3 hard-kill impl that will follow (and any +upstream CPython bug reports we end up filing). + +## Human edits + +None — committed as generated. The commit message +itself was also AI-drafted via `/commit-msg` and +rewrapped via the project's `rewrap.py --width 67` +tooling; user landed it without edits. diff --git a/ai/prompt-io/claude/20260418T042526Z_26fb820_prompt_io.raw.md b/ai/prompt-io/claude/20260418T042526Z_26fb820_prompt_io.raw.md new file mode 100644 index 00000000..6b5da293 --- /dev/null +++ b/ai/prompt-io/claude/20260418T042526Z_26fb820_prompt_io.raw.md @@ -0,0 +1,220 @@ +--- +model: claude-opus-4-7[1m] +service: claude +timestamp: 2026-04-18T04:25:26Z +git_ref: 26fb820 +diff_cmd: git diff 26fb820~1..26fb820 +--- + +Code generated in this turn was committed as `26fb8206` +("Fix subint destroy race via dedicated OS thread"). +Per diff-ref mode, per-file code is captured via the +pointer below; the rest is verbatim non-code output +— test-session diagnostics, CPython-internals +reasoning, and the design synthesis that's worth +keeping alongside the code since it informs the +still-pending Phase B.3 hard-kill work. + +## Per-file generated content + +### `tractor/spawn/_subint.py` (modified, +110/-84 LOC) + +> `git diff 26fb820~1..26fb820 -- tractor/spawn/_subint.py` + +Rewrites the subint driver-thread strategy: +- replaces `trio.to_thread.run_sync(_interpreters.exec, ...)` + with a plain `threading.Thread(target=_subint_target, + daemon=False)` so the OS thread truly exits after + `_interpreters.exec()` returns +- captures a `trio.lowlevel.current_trio_token()` at + `subint_proc` entry; the driver thread signals + completion back via `trio.from_thread.run_sync( + subint_exited.set, trio_token=...)` +- swallows `trio.RunFinishedError` in the signal path + for the case where the parent trio loop has already + exited (process teardown) +- in the teardown `finally` off-loads the sync + `driver_thread.join()` call to `trio.to_thread.run_sync` + (a cache thread w/ no subint tstate — so no cache + conflict) to wait for the driver thread to fully + exit before calling `_interpreters.destroy()` + +## Non-code output (verbatim) — the CPython-internals research + +### What went wrong before this commit + +Under `--spawn-backend=subint` on py3.14, most single +tests passed but longer runs hung intermittently. The +position of the hang moved between runs (test #22 on +one run, test #53 on another) suggesting a timing- +dependent race rather than a deterministic bug. + +`faulthandler.dump_traceback_later()` eventually +caught a stack with the main thread blocked in +`_interpreters.destroy(interp_id)` at `_subint.py:293`. +Only 2 threads were alive: +- main thread waiting in `_interpreters.destroy()` +- one idle trio thread-cache worker in + `trio._core._thread_cache._work` + +No subint was still running (`_interpreters.list_all()` +showed only the main interp). A resource-tracker +pytest fixture confirmed threads/subints did NOT +accumulate across tests — this was not a leak but a +specific "destroy blocks on cached thread w/ stale +tstate" race. + +### Why the race exists + +`trio.to_thread.run_sync` uses a thread *cache* to +avoid OS-thread creation overhead. When the sync +callable returns, the OS thread is NOT terminated — +it's parked in `_thread_cache._work` waiting for the +next job. CPython's subinterpreter implementation +attaches a **tstate** (thread-state object) to each +OS thread that ever entered a subint via +`_interpreters.exec()`. That tstate is released +lazily — either when the thread picks up a new job +(which re-attaches a new tstate, evicting the old +one) or when the thread truly exits. + +`_interpreters.destroy(interp_id)` waits for *all* +tstates associated w/ that subint to be released +before it can proceed. If the cached worker is idle +holding the stale tstate, destroy blocks indefinitely. +Whether the race manifests depends on timing — if +the cached thread happens to pick up another job +quickly, destroy unblocks; if it sits idle, we hang. + +### Why a dedicated `threading.Thread` fixes it + +A plain `threading.Thread(target=_subint_target, +daemon=False)` runs its target once and exits. When +the target returns, OS-thread teardown (`_bootstrap_inner` +→ `_bootstrap`) fires and CPython releases the +tstate for that thread. `_interpreters.destroy()` +then has no blocker. + +### Diagnostic tactics that actually helped + +1. `faulthandler.dump_traceback_later(n, repeat=False, + file=open(path, 'w'))` for captured stack dumps on + hang. Critically, pipe to a `file=` not stderr — + pytest captures stderr weirdly and the dump is + easy to miss. +2. A resource-tracker autouse fixture printing + per-test `threading.active_count()` + + `len(_interpreters.list_all())` deltas → ruled out + leak-accumulation theories quickly. +3. Running the hanging test *solo* vs in-suite — + when solo passes but in-suite hangs, you know + it's a cross-test state-transfer bug rather than + a test-internal bug. + +### Design synthesis — SIGINT + subints + SC + +The user and I walked through the cancellation +semantics of PEP 684/734 subinterpreters in detail. +Key findings we want to preserve: + +**Signal delivery in subints (stdlib limitation).** +CPython's signal machinery only delivers signals +(SIGINT included) to the *main thread of the main +interpreter*. Subints cannot install signal handlers +that will ever fire. This is an intentional design +choice in PEP 684 and not expected to change. For +tractor's subint actors, this means: + +- Ctrl-C never reaches a subint directly. +- `trio.run()` running on a worker thread (as we do + for subints) already skips SIGINT handler install + because `signal.signal()` raises on non-main + threads. +- The only cancellation surface into a subint is + our IPC `Portal.cancel_actor()`. + +**Legacy-mode subints share the main GIL** (which +our impl uses since `msgspec` lacks PEP 684 support +per `jcrist/msgspec#563`). This means a stuck subint +thread can starve the parent's trio loop during +cancellation — the parent can't even *start* its +teardown handling until the subint yields the GIL. + +**Failure modes identified for Phase B.3 audit:** + +1. Portal cancel lands cleanly → subint unwinds → + thread exits → destroy succeeds. (Happy path.) +2. IPC channel is already broken when we try to + send cancel (e.g., `test_ipc_channel_break_*`) + → cancel raises `BrokenResourceError` → subint + keeps running unaware → parent hangs waiting for + `subint_exited`. This is what breaks + `test_advanced_faults.py` under subint. +3. Subint is stuck in non-checkpointing Python code + → portal-cancel msg queued but never processed. +4. Subint is in a shielded cancel scope when cancel + arrives → delay until shield exits. + +**Current teardown has a shield-bug too:** +`trio.CancelScope(shield=True)` wrapping the `finally` +block absorbs Ctrl-C, so even when the user tries +to break out they can't. This is the reason +`test_ipc_channel_break_during_stream[break_parent-... +no_msgstream_aclose]` locks up unkillable. + +**B.3 hard-kill fix plan (next commit):** + +1. Bound `driver_thread.join()` with + `trio.move_on_after(HARD_KILL_TIMEOUT)`. +2. If it times out, log a warning naming the + `interp_id` and switch the driver thread to + `daemon=True` mode (not actually possible after + start — so instead create as daemon=True upfront + and accept the tradeoff of proc-exit not waiting + for a stuck subint). +3. Best-effort `_interpreters.destroy()`; catch the + `InterpreterError` if the subint is still running. +4. Document that the leak is real and the only + escape hatch we have without upstream cooperation. + +**Test plan for Phase B.3:** + +New `tests/test_subint_cancellation.py` covering: +- SIGINT at spawn +- SIGINT mid-portal-RPC +- SIGINT during shielded section in subint +- Dead-channel cancel (mirror of `test_ipc_channel_ + break_during_stream` minimized) +- Non-checkpointing subint (tight `while True` in + user code) +- Per-test `pytest-timeout`-style bounds so the + tests visibly fail instead of wedging the runner + +### Sanity-check output (verbatim terminal excerpts) + +Post-fix single-test validation: +``` +1 passed, 1 warning in 1.06s +``` +(same test that was hanging pre-fix: +`test_parent_cancels[...cancel_method=ctx-...False]`) + +Full `tests/test_context_stream_semantics.py` +under subint: +``` +61 passed, 1 warning in 100.35s (0:01:40) +``` +and a clean-cache re-run: +``` +61 passed, 1 warning in 100.82s (0:01:40) +``` + +No regressions on trio backend (same subset): +``` +69 passed, 1 skipped, 3 warnings in 89.19s +``` + +### Commit msg + +Also AI-drafted via `/commit-msg` + `rewrap.py +--width 67`. See `git log -1 26fb820`.