From c041518bdb4aa25fb182ace4bb8cb8c7d49379ab Mon Sep 17 00:00:00 2001
From: goodboy <goodboy_foss@protonmail.com>
Date: Sat, 18 Apr 2026 16:04:57 -0400
Subject: [PATCH] Add prompt-IO log for subint destroy-race fix

Log the `claude-opus-4-7` session that produced
the `_subint.py` dedicated-thread fix (`26fb8206`).
Substantive bc the patch was entirely AI-generated;
raw log also preserves the CPython-internals
research informing Phase B.3 hard-kill work.

Prompt-IO: ai/prompt-io/claude/20260418T042526Z_26fb820_prompt_io.md

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
---
 .../20260418T042526Z_26fb820_prompt_io.md     | 117 ++++++++++
 .../20260418T042526Z_26fb820_prompt_io.raw.md | 220 ++++++++++++++++++
 2 files changed, 337 insertions(+)
 create mode 100644 ai/prompt-io/claude/20260418T042526Z_26fb820_prompt_io.md
 create mode 100644 ai/prompt-io/claude/20260418T042526Z_26fb820_prompt_io.raw.md

diff --git a/ai/prompt-io/claude/20260418T042526Z_26fb820_prompt_io.md b/ai/prompt-io/claude/20260418T042526Z_26fb820_prompt_io.md
new file mode 100644
index 00000000..0cf0e41d
--- /dev/null
+++ b/ai/prompt-io/claude/20260418T042526Z_26fb820_prompt_io.md
@@ -0,0 +1,117 @@
+---
+model: claude-opus-4-7[1m]
+service: claude
+session: subints-phase-b2-destroy-race-fix
+timestamp: 2026-04-18T04:25:26Z
+git_ref: 26fb820
+scope: code
+substantive: true
+raw_file: 20260418T042526Z_26fb820_prompt_io.raw.md
+---
+
+## Prompt
+
+Follow-up to Phase B.2 (`5cd6df58`) after the user
+observed intermittent mid-suite hangs when running
+the tractor test suite under `--spawn-backend=subint`
+on py3.14. The specific sequence of prompts over
+several turns:
+
+1. User pointed at the `test_context_stream_semantics.py`
+   suite as the first thing to make run clean under
+   `--spawn-backend=subint`.
+2. After a series of `timeout`-terminated runs that
+   gave no diagnostic info, user nudged me to stop
+   relying on `timeout` and get actual runtime
+   diagnostics ("the suite hangs indefinitely, so i
+   don't think this `timeout 30` is helping you at
+   all.."). Switched to
+   `faulthandler.dump_traceback_later(...)` and a
+   resource-tracker fixture to rule out leaks.
+3. Captured a stack pinning the hang on
+   `_interpreters.destroy(interp_id)` in the subint
+   teardown finally block.
+4. Proposed dedicated-OS-thread fix. User greenlit.
+5. Implemented + verified on-worktree; user needed
+   to be pointed at the *worktree*'s `./py313` venv
+   because bare `pytest` was picking up the main
+   repo's venv (running un-patched `_subint.py`) and
+   still hanging.
+
+Running theme over the whole exchange: this patch
+only closes the *destroy race*. The user and I also
+traced through the deeper cancellation story — SIGINT
+can't reach subints, legacy-mode shares the GIL,
+portal-cancel dies when the IPC channel is already
+broken — and agreed the next step is a bounded
+hard-kill in `subint_proc`'s teardown plus a
+dedicated cancellation test suite. Those land as
+separate commits.
+
+## Response summary
+
+Produced the `tractor/spawn/_subint.py` patch landed
+as commit `26fb8206` ("Fix subint destroy race via
+dedicated OS thread"). One file, +110/-84 LOC.
+
+Mechanism: swap `trio.to_thread.run_sync(_interpreters
+.exec, ...)` for a plain `threading.Thread(target=...
+, daemon=False)`. The trio thread cache recycles
+workers — so the OS thread that ran `_interpreters
+.exec()` remained alive in the cache holding a
+stale subint tstate, blocking
+`_interpreters.destroy()` in the finally indefinitely.
+A dedicated one-shot thread exits naturally after
+the sync target returns, releasing tstate and
+unblocking destroy.
+
+Coordination across the trio↔thread boundary:
+- `trio.lowlevel.current_trio_token()` captured at
+  `subint_proc` entry
+- driver thread signals `subint_exited.set()` back
+  to parent trio via `trio.from_thread.run_sync(...,
+  trio_token=token)` (synchronous from the thread's
+  POV; the call returns after trio has run `.set()`)
+- `trio.RunFinishedError` swallowed in that path for
+  the process-teardown case where parent trio already
+  exited
+- teardown `finally` off-loads the sync
+  `driver_thread.join()` via `to_thread.run_sync` (a
+  cache thread carries no subint tstate — safe)
+
+## Files changed
+
+See `git diff 26fb820~1..26fb820 --stat`:
+
+```
+ tractor/spawn/_subint.py | 194 +++++++++++++++++++------------
+ 1 file changed, 110 insertions(+), 84 deletions(-)
+```
+
+Validation:
+- `test_parent_cancels[chk_ctx_result_before_exit=True-
+  cancel_method=ctx-child_returns_early=False]`
+  (the specific test that was hanging for the user)
+  — passed in 1.06s.
+- Full `tests/test_context_stream_semantics.py` under
+  subint — 61 passed in 100.35s (clean-cache re-run:
+  100.82s).
+- Trio backend regression subset — 69 passed / 1
+  skipped / 89.19s — no regressions from this change.
+
+## Files changed
+
+Beyond the `_subint.py` patch, the raw log also
+records the cancellation-semantics research that
+spanned this conversation but did not ship as code
+in *this* commit. Preserving it inline under "Non-
+code output" because it directly informs the
+Phase B.3 hard-kill impl that will follow (and any
+upstream CPython bug reports we end up filing).
+
+## Human edits
+
+None — committed as generated. The commit message
+itself was also AI-drafted via `/commit-msg` and
+rewrapped via the project's `rewrap.py --width 67`
+tooling; user landed it without edits.
diff --git a/ai/prompt-io/claude/20260418T042526Z_26fb820_prompt_io.raw.md b/ai/prompt-io/claude/20260418T042526Z_26fb820_prompt_io.raw.md
new file mode 100644
index 00000000..6b5da293
--- /dev/null
+++ b/ai/prompt-io/claude/20260418T042526Z_26fb820_prompt_io.raw.md
@@ -0,0 +1,220 @@
+---
+model: claude-opus-4-7[1m]
+service: claude
+timestamp: 2026-04-18T04:25:26Z
+git_ref: 26fb820
+diff_cmd: git diff 26fb820~1..26fb820
+---
+
+Code generated in this turn was committed as `26fb8206`
+("Fix subint destroy race via dedicated OS thread").
+Per diff-ref mode, per-file code is captured via the
+pointer below; the rest is verbatim non-code output
+— test-session diagnostics, CPython-internals
+reasoning, and the design synthesis that's worth
+keeping alongside the code since it informs the
+still-pending Phase B.3 hard-kill work.
+
+## Per-file generated content
+
+### `tractor/spawn/_subint.py` (modified, +110/-84 LOC)
+
+> `git diff 26fb820~1..26fb820 -- tractor/spawn/_subint.py`
+
+Rewrites the subint driver-thread strategy:
+- replaces `trio.to_thread.run_sync(_interpreters.exec, ...)`
+  with a plain `threading.Thread(target=_subint_target,
+  daemon=False)` so the OS thread truly exits after
+  `_interpreters.exec()` returns
+- captures a `trio.lowlevel.current_trio_token()` at
+  `subint_proc` entry; the driver thread signals
+  completion back via `trio.from_thread.run_sync(
+  subint_exited.set, trio_token=...)`
+- swallows `trio.RunFinishedError` in the signal path
+  for the case where the parent trio loop has already
+  exited (process teardown)
+- in the teardown `finally` off-loads the sync
+  `driver_thread.join()` call to `trio.to_thread.run_sync`
+  (a cache thread w/ no subint tstate — so no cache
+  conflict) to wait for the driver thread to fully
+  exit before calling `_interpreters.destroy()`
+
+## Non-code output (verbatim) — the CPython-internals research
+
+### What went wrong before this commit
+
+Under `--spawn-backend=subint` on py3.14, most single
+tests passed but longer runs hung intermittently. The
+position of the hang moved between runs (test #22 on
+one run, test #53 on another) suggesting a timing-
+dependent race rather than a deterministic bug.
+
+`faulthandler.dump_traceback_later()` eventually
+caught a stack with the main thread blocked in
+`_interpreters.destroy(interp_id)` at `_subint.py:293`.
+Only 2 threads were alive:
+- main thread waiting in `_interpreters.destroy()`
+- one idle trio thread-cache worker in
+  `trio._core._thread_cache._work`
+
+No subint was still running (`_interpreters.list_all()`
+showed only the main interp). A resource-tracker
+pytest fixture confirmed threads/subints did NOT
+accumulate across tests — this was not a leak but a
+specific "destroy blocks on cached thread w/ stale
+tstate" race.
+
+### Why the race exists
+
+`trio.to_thread.run_sync` uses a thread *cache* to
+avoid OS-thread creation overhead. When the sync
+callable returns, the OS thread is NOT terminated —
+it's parked in `_thread_cache._work` waiting for the
+next job. CPython's subinterpreter implementation
+attaches a **tstate** (thread-state object) to each
+OS thread that ever entered a subint via
+`_interpreters.exec()`. That tstate is released
+lazily — either when the thread picks up a new job
+(which re-attaches a new tstate, evicting the old
+one) or when the thread truly exits.
+
+`_interpreters.destroy(interp_id)` waits for *all*
+tstates associated w/ that subint to be released
+before it can proceed. If the cached worker is idle
+holding the stale tstate, destroy blocks indefinitely.
+Whether the race manifests depends on timing — if
+the cached thread happens to pick up another job
+quickly, destroy unblocks; if it sits idle, we hang.
+
+### Why a dedicated `threading.Thread` fixes it
+
+A plain `threading.Thread(target=_subint_target,
+daemon=False)` runs its target once and exits. When
+the target returns, OS-thread teardown (`_bootstrap_inner`
+→ `_bootstrap`) fires and CPython releases the
+tstate for that thread. `_interpreters.destroy()`
+then has no blocker.
+
+### Diagnostic tactics that actually helped
+
+1. `faulthandler.dump_traceback_later(n, repeat=False,
+   file=open(path, 'w'))` for captured stack dumps on
+   hang. Critically, pipe to a `file=` not stderr —
+   pytest captures stderr weirdly and the dump is
+   easy to miss.
+2. A resource-tracker autouse fixture printing
+   per-test `threading.active_count()` +
+   `len(_interpreters.list_all())` deltas → ruled out
+   leak-accumulation theories quickly.
+3. Running the hanging test *solo* vs in-suite —
+   when solo passes but in-suite hangs, you know
+   it's a cross-test state-transfer bug rather than
+   a test-internal bug.
+
+### Design synthesis — SIGINT + subints + SC
+
+The user and I walked through the cancellation
+semantics of PEP 684/734 subinterpreters in detail.
+Key findings we want to preserve:
+
+**Signal delivery in subints (stdlib limitation).**
+CPython's signal machinery only delivers signals
+(SIGINT included) to the *main thread of the main
+interpreter*. Subints cannot install signal handlers
+that will ever fire. This is an intentional design
+choice in PEP 684 and not expected to change. For
+tractor's subint actors, this means:
+
+- Ctrl-C never reaches a subint directly.
+- `trio.run()` running on a worker thread (as we do
+  for subints) already skips SIGINT handler install
+  because `signal.signal()` raises on non-main
+  threads.
+- The only cancellation surface into a subint is
+  our IPC `Portal.cancel_actor()`.
+
+**Legacy-mode subints share the main GIL** (which
+our impl uses since `msgspec` lacks PEP 684 support
+per `jcrist/msgspec#563`). This means a stuck subint
+thread can starve the parent's trio loop during
+cancellation — the parent can't even *start* its
+teardown handling until the subint yields the GIL.
+
+**Failure modes identified for Phase B.3 audit:**
+
+1. Portal cancel lands cleanly → subint unwinds →
+   thread exits → destroy succeeds. (Happy path.)
+2. IPC channel is already broken when we try to
+   send cancel (e.g., `test_ipc_channel_break_*`)
+   → cancel raises `BrokenResourceError` → subint
+   keeps running unaware → parent hangs waiting for
+   `subint_exited`. This is what breaks
+   `test_advanced_faults.py` under subint.
+3. Subint is stuck in non-checkpointing Python code
+   → portal-cancel msg queued but never processed.
+4. Subint is in a shielded cancel scope when cancel
+   arrives → delay until shield exits.
+
+**Current teardown has a shield-bug too:**
+`trio.CancelScope(shield=True)` wrapping the `finally`
+block absorbs Ctrl-C, so even when the user tries
+to break out they can't. This is the reason
+`test_ipc_channel_break_during_stream[break_parent-...
+no_msgstream_aclose]` locks up unkillable.
+
+**B.3 hard-kill fix plan (next commit):**
+
+1. Bound `driver_thread.join()` with
+   `trio.move_on_after(HARD_KILL_TIMEOUT)`.
+2. If it times out, log a warning naming the
+   `interp_id` and switch the driver thread to
+   `daemon=True` mode (not actually possible after
+   start — so instead create as daemon=True upfront
+   and accept the tradeoff of proc-exit not waiting
+   for a stuck subint).
+3. Best-effort `_interpreters.destroy()`; catch the
+   `InterpreterError` if the subint is still running.
+4. Document that the leak is real and the only
+   escape hatch we have without upstream cooperation.
+
+**Test plan for Phase B.3:**
+
+New `tests/test_subint_cancellation.py` covering:
+- SIGINT at spawn
+- SIGINT mid-portal-RPC
+- SIGINT during shielded section in subint
+- Dead-channel cancel (mirror of `test_ipc_channel_
+  break_during_stream` minimized)
+- Non-checkpointing subint (tight `while True` in
+  user code)
+- Per-test `pytest-timeout`-style bounds so the
+  tests visibly fail instead of wedging the runner
+
+### Sanity-check output (verbatim terminal excerpts)
+
+Post-fix single-test validation:
+```
+1 passed, 1 warning in 1.06s
+```
+(same test that was hanging pre-fix:
+`test_parent_cancels[...cancel_method=ctx-...False]`)
+
+Full `tests/test_context_stream_semantics.py`
+under subint:
+```
+61 passed, 1 warning in 100.35s (0:01:40)
+```
+and a clean-cache re-run:
+```
+61 passed, 1 warning in 100.82s (0:01:40)
+```
+
+No regressions on trio backend (same subset):
+```
+69 passed, 1 skipped, 3 warnings in 89.19s
+```
+
+### Commit msg
+
+Also AI-drafted via `/commit-msg` + `rewrap.py
+--width 67`. See `git log -1 26fb820`.