Add prompt-IO log for subint destroy-race fix

Log the `claude-opus-4-7` session that produced the `_subint.py` dedicated-thread fix (`26fb8206`). Substantive bc the patch was entirely AI-generated; raw log also preserves the CPython-internals research informing Phase B.3 hard-kill work. Prompt-IO: ai/prompt-io/claude/20260418T042526Z_26fb820_prompt_io.md (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-18 16:04:57 -04:00 · 2026-04-18 16:04:57 -04:00 · b48503085f
parent 3869a9b468
commit b48503085f
2 changed files with 337 additions and 0 deletions
--- a/ai/prompt-io/claude/20260418T042526Z_26fb820_prompt_io.md
+++ b/ai/prompt-io/claude/20260418T042526Z_26fb820_prompt_io.md
@ -0,0 +1,117 @@
 ---
 model: claude-opus-4-7[1m]
 service: claude
 session: subints-phase-b2-destroy-race-fix
 timestamp: 2026-04-18T04:25:26Z
 git_ref: 26fb820
 scope: code
 substantive: true
 raw_file: 20260418T042526Z_26fb820_prompt_io.raw.md
 ---
 ## Prompt
 Follow-up to Phase B.2 (`5cd6df58`) after the user
 observed intermittent mid-suite hangs when running
 the tractor test suite under `--spawn-backend=subint`
 on py3.14. The specific sequence of prompts over
 several turns:
 1. User pointed at the `test_context_stream_semantics.py`
   suite as the first thing to make run clean under
   `--spawn-backend=subint`.
 2. After a series of `timeout`-terminated runs that
   gave no diagnostic info, user nudged me to stop
   relying on `timeout` and get actual runtime
   diagnostics ("the suite hangs indefinitely, so i
   don't think this `timeout 30` is helping you at
   all.."). Switched to
   `faulthandler.dump_traceback_later(...)` and a
   resource-tracker fixture to rule out leaks.
 3. Captured a stack pinning the hang on
   `_interpreters.destroy(interp_id)` in the subint
   teardown finally block.
 4. Proposed dedicated-OS-thread fix. User greenlit.
 5. Implemented + verified on-worktree; user needed
   to be pointed at the *worktree*'s `./py313` venv
   because bare `pytest` was picking up the main
   repo's venv (running un-patched `_subint.py`) and
   still hanging.
 Running theme over the whole exchange: this patch
 only closes the *destroy race*. The user and I also
 traced through the deeper cancellation story — SIGINT
 can't reach subints, legacy-mode shares the GIL,
 portal-cancel dies when the IPC channel is already
 broken — and agreed the next step is a bounded
 hard-kill in `subint_proc`'s teardown plus a
 dedicated cancellation test suite. Those land as
 separate commits.
 ## Response summary
 Produced the `tractor/spawn/_subint.py` patch landed
 as commit `26fb8206` ("Fix subint destroy race via
 dedicated OS thread"). One file, +110/-84 LOC.
 Mechanism: swap `trio.to_thread.run_sync(_interpreters
 .exec, ...)` for a plain `threading.Thread(target=...
 , daemon=False)`. The trio thread cache recycles
 workers — so the OS thread that ran `_interpreters
 .exec()` remained alive in the cache holding a
 stale subint tstate, blocking
 `_interpreters.destroy()` in the finally indefinitely.
 A dedicated one-shot thread exits naturally after
 the sync target returns, releasing tstate and
 unblocking destroy.
 Coordination across the trio↔thread boundary:
 - `trio.lowlevel.current_trio_token()` captured at
  `subint_proc` entry
 - driver thread signals `subint_exited.set()` back
  to parent trio via `trio.from_thread.run_sync(...,
  trio_token=token)` (synchronous from the thread's
  POV; the call returns after trio has run `.set()`)
 - `trio.RunFinishedError` swallowed in that path for
  the process-teardown case where parent trio already
  exited
 - teardown `finally` off-loads the sync
  `driver_thread.join()` via `to_thread.run_sync` (a
  cache thread carries no subint tstate — safe)
 ## Files changed
 See `git diff 26fb820~1..26fb820 --stat`:
 ```
 tractor/spawn/_subint.py | 194 +++++++++++++++++++------------
 1 file changed, 110 insertions(+), 84 deletions(-)
 ```
 Validation:
 - `test_parent_cancels[chk_ctx_result_before_exit=True-
  cancel_method=ctx-child_returns_early=False]`
  (the specific test that was hanging for the user)
  — passed in 1.06s.
 - Full `tests/test_context_stream_semantics.py` under
  subint — 61 passed in 100.35s (clean-cache re-run:
  100.82s).
 - Trio backend regression subset — 69 passed / 1
  skipped / 89.19s — no regressions from this change.
 ## Files changed
 Beyond the `_subint.py` patch, the raw log also
 records the cancellation-semantics research that
 spanned this conversation but did not ship as code
 in *this* commit. Preserving it inline under "Non-
 code output" because it directly informs the
 Phase B.3 hard-kill impl that will follow (and any
 upstream CPython bug reports we end up filing).
 ## Human edits
 None — committed as generated. The commit message
 itself was also AI-drafted via `/commit-msg` and
 rewrapped via the project's `rewrap.py --width 67`
 tooling; user landed it without edits.
--- a/ai/prompt-io/claude/20260418T042526Z_26fb820_prompt_io.raw.md
+++ b/ai/prompt-io/claude/20260418T042526Z_26fb820_prompt_io.raw.md
@ -0,0 +1,220 @@
 ---
 model: claude-opus-4-7[1m]
 service: claude
 timestamp: 2026-04-18T04:25:26Z
 git_ref: 26fb820
 diff_cmd: git diff 26fb820~1..26fb820
 ---
 Code generated in this turn was committed as `26fb8206`
 ("Fix subint destroy race via dedicated OS thread").
 Per diff-ref mode, per-file code is captured via the
 pointer below; the rest is verbatim non-code output
 — test-session diagnostics, CPython-internals
 reasoning, and the design synthesis that's worth
 keeping alongside the code since it informs the
 still-pending Phase B.3 hard-kill work.
 ## Per-file generated content
 ### `tractor/spawn/_subint.py` (modified, +110/-84 LOC)
 > `git diff 26fb820~1..26fb820 -- tractor/spawn/_subint.py`
 Rewrites the subint driver-thread strategy:
 - replaces `trio.to_thread.run_sync(_interpreters.exec, ...)`
  with a plain `threading.Thread(target=_subint_target,
  daemon=False)` so the OS thread truly exits after
  `_interpreters.exec()` returns
 - captures a `trio.lowlevel.current_trio_token()` at
  `subint_proc` entry; the driver thread signals
  completion back via `trio.from_thread.run_sync(
  subint_exited.set, trio_token=...)`
 - swallows `trio.RunFinishedError` in the signal path
  for the case where the parent trio loop has already
  exited (process teardown)
 - in the teardown `finally` off-loads the sync
  `driver_thread.join()` call to `trio.to_thread.run_sync`
  (a cache thread w/ no subint tstate — so no cache
  conflict) to wait for the driver thread to fully
  exit before calling `_interpreters.destroy()`
 ## Non-code output (verbatim) — the CPython-internals research
 ### What went wrong before this commit
 Under `--spawn-backend=subint` on py3.14, most single
 tests passed but longer runs hung intermittently. The
 position of the hang moved between runs (test #22 on
 one run, test #53 on another) suggesting a timing-
 dependent race rather than a deterministic bug.
 `faulthandler.dump_traceback_later()` eventually
 caught a stack with the main thread blocked in
 `_interpreters.destroy(interp_id)` at `_subint.py:293`.
 Only 2 threads were alive:
 - main thread waiting in `_interpreters.destroy()`
 - one idle trio thread-cache worker in
  `trio._core._thread_cache._work`
 No subint was still running (`_interpreters.list_all()`
 showed only the main interp). A resource-tracker
 pytest fixture confirmed threads/subints did NOT
 accumulate across tests — this was not a leak but a
 specific "destroy blocks on cached thread w/ stale
 tstate" race.
 ### Why the race exists
 `trio.to_thread.run_sync` uses a thread *cache* to
 avoid OS-thread creation overhead. When the sync
 callable returns, the OS thread is NOT terminated —
 it's parked in `_thread_cache._work` waiting for the
 next job. CPython's subinterpreter implementation
 attaches a **tstate** (thread-state object) to each
 OS thread that ever entered a subint via
 `_interpreters.exec()`. That tstate is released
 lazily — either when the thread picks up a new job
 (which re-attaches a new tstate, evicting the old
 one) or when the thread truly exits.
 `_interpreters.destroy(interp_id)` waits for *all*
 tstates associated w/ that subint to be released
 before it can proceed. If the cached worker is idle
 holding the stale tstate, destroy blocks indefinitely.
 Whether the race manifests depends on timing — if
 the cached thread happens to pick up another job
 quickly, destroy unblocks; if it sits idle, we hang.
 ### Why a dedicated `threading.Thread` fixes it
 A plain `threading.Thread(target=_subint_target,
 daemon=False)` runs its target once and exits. When
 the target returns, OS-thread teardown (`_bootstrap_inner`
 → `_bootstrap`) fires and CPython releases the
 tstate for that thread. `_interpreters.destroy()`
 then has no blocker.
 ### Diagnostic tactics that actually helped
 1. `faulthandler.dump_traceback_later(n, repeat=False,
   file=open(path, 'w'))` for captured stack dumps on
   hang. Critically, pipe to a `file=` not stderr —
   pytest captures stderr weirdly and the dump is
   easy to miss.
 2. A resource-tracker autouse fixture printing
   per-test `threading.active_count()` +
   `len(_interpreters.list_all())` deltas → ruled out
   leak-accumulation theories quickly.
 3. Running the hanging test *solo* vs in-suite —
   when solo passes but in-suite hangs, you know
   it's a cross-test state-transfer bug rather than
   a test-internal bug.
 ### Design synthesis — SIGINT + subints + SC
 The user and I walked through the cancellation
 semantics of PEP 684/734 subinterpreters in detail.
 Key findings we want to preserve:
 **Signal delivery in subints (stdlib limitation).**
 CPython's signal machinery only delivers signals
 (SIGINT included) to the *main thread of the main
 interpreter*. Subints cannot install signal handlers
 that will ever fire. This is an intentional design
 choice in PEP 684 and not expected to change. For
 tractor's subint actors, this means:
 - Ctrl-C never reaches a subint directly.
 - `trio.run()` running on a worker thread (as we do
  for subints) already skips SIGINT handler install
  because `signal.signal()` raises on non-main
  threads.
 - The only cancellation surface into a subint is
  our IPC `Portal.cancel_actor()`.
 **Legacy-mode subints share the main GIL** (which
 our impl uses since `msgspec` lacks PEP 684 support
 per `jcrist/msgspec#563`). This means a stuck subint
 thread can starve the parent's trio loop during
 cancellation — the parent can't even *start* its
 teardown handling until the subint yields the GIL.
 **Failure modes identified for Phase B.3 audit:**
 1. Portal cancel lands cleanly → subint unwinds →
   thread exits → destroy succeeds. (Happy path.)
 2. IPC channel is already broken when we try to
   send cancel (e.g., `test_ipc_channel_break_*`)
   → cancel raises `BrokenResourceError` → subint
   keeps running unaware → parent hangs waiting for
   `subint_exited`. This is what breaks
   `test_advanced_faults.py` under subint.
 3. Subint is stuck in non-checkpointing Python code
   → portal-cancel msg queued but never processed.
 4. Subint is in a shielded cancel scope when cancel
   arrives → delay until shield exits.
 **Current teardown has a shield-bug too:**
 `trio.CancelScope(shield=True)` wrapping the `finally`
 block absorbs Ctrl-C, so even when the user tries
 to break out they can't. This is the reason
 `test_ipc_channel_break_during_stream[break_parent-...
 no_msgstream_aclose]` locks up unkillable.
 **B.3 hard-kill fix plan (next commit):**
 1. Bound `driver_thread.join()` with
   `trio.move_on_after(HARD_KILL_TIMEOUT)`.
 2. If it times out, log a warning naming the
   `interp_id` and switch the driver thread to
   `daemon=True` mode (not actually possible after
   start — so instead create as daemon=True upfront
   and accept the tradeoff of proc-exit not waiting
   for a stuck subint).
 3. Best-effort `_interpreters.destroy()`; catch the
   `InterpreterError` if the subint is still running.
 4. Document that the leak is real and the only
   escape hatch we have without upstream cooperation.
 **Test plan for Phase B.3:**
 New `tests/test_subint_cancellation.py` covering:
 - SIGINT at spawn
 - SIGINT mid-portal-RPC
 - SIGINT during shielded section in subint
 - Dead-channel cancel (mirror of `test_ipc_channel_
  break_during_stream` minimized)
 - Non-checkpointing subint (tight `while True` in
  user code)
 - Per-test `pytest-timeout`-style bounds so the
  tests visibly fail instead of wedging the runner
 ### Sanity-check output (verbatim terminal excerpts)
 Post-fix single-test validation:
 ```
 1 passed, 1 warning in 1.06s
 ```
 (same test that was hanging pre-fix:
 `test_parent_cancels[...cancel_method=ctx-...False]`)
 Full `tests/test_context_stream_semantics.py`
 under subint:
 ```
 61 passed, 1 warning in 100.35s (0:01:40)
 ```
 and a clean-cache re-run:
 ```
 61 passed, 1 warning in 100.82s (0:01:40)
 ```
 No regressions on trio backend (same subset):
 ```
 69 passed, 1 skipped, 3 warnings in 89.19s
 ```
 ### Commit msg
 Also AI-drafted via `/commit-msg` + `rewrap.py
 --width 67`. See `git log -1 26fb820`.