tractor/ai/prompt-io/claude/20260418T042526Z_26fb820_pr...

---
model: claude-opus-4-7[1m]
service: claude
timestamp: 2026-04-18T04:25:26Z
git_ref: 26fb820
diff_cmd: git diff 26fb820~1..26fb820
---

Code generated in this turn was committed as `26fb8206`
("Fix subint destroy race via dedicated OS thread").
Per diff-ref mode, per-file code is captured via the
pointer below; the rest is verbatim non-code output
— test-session diagnostics, CPython-internals
reasoning, and the design synthesis that's worth
keeping alongside the code since it informs the
still-pending Phase B.3 hard-kill work.

## Per-file generated content

### `tractor/spawn/_subint.py` (modified, +110/-84 LOC)

> `git diff 26fb820~1..26fb820 -- tractor/spawn/_subint.py`

Rewrites the subint driver-thread strategy:
- replaces `trio.to_thread.run_sync(_interpreters.exec, ...)`
  with a plain `threading.Thread(target=_subint_target,
  daemon=False)` so the OS thread truly exits after
  `_interpreters.exec()` returns
- captures a `trio.lowlevel.current_trio_token()` at
  `subint_proc` entry; the driver thread signals
  completion back via `trio.from_thread.run_sync(
  subint_exited.set, trio_token=...)`
- swallows `trio.RunFinishedError` in the signal path
  for the case where the parent trio loop has already
  exited (process teardown)
- in the teardown `finally` off-loads the sync
  `driver_thread.join()` call to `trio.to_thread.run_sync`
  (a cache thread w/ no subint tstate — so no cache
  conflict) to wait for the driver thread to fully
  exit before calling `_interpreters.destroy()`

## Non-code output (verbatim) — the CPython-internals research

### What went wrong before this commit

Under `--spawn-backend=subint` on py3.14, most single
tests passed but longer runs hung intermittently. The
position of the hang moved between runs (test #22 on
one run, test #53 on another) suggesting a timing-
dependent race rather than a deterministic bug.

`faulthandler.dump_traceback_later()` eventually
caught a stack with the main thread blocked in
`_interpreters.destroy(interp_id)` at `_subint.py:293`.
Only 2 threads were alive:
- main thread waiting in `_interpreters.destroy()`
- one idle trio thread-cache worker in
  `trio._core._thread_cache._work`

No subint was still running (`_interpreters.list_all()`
showed only the main interp). A resource-tracker
pytest fixture confirmed threads/subints did NOT
accumulate across tests — this was not a leak but a
specific "destroy blocks on cached thread w/ stale
tstate" race.

### Why the race exists

`trio.to_thread.run_sync` uses a thread *cache* to
avoid OS-thread creation overhead. When the sync
callable returns, the OS thread is NOT terminated —
it's parked in `_thread_cache._work` waiting for the
next job. CPython's subinterpreter implementation
attaches a **tstate** (thread-state object) to each
OS thread that ever entered a subint via
`_interpreters.exec()`. That tstate is released
lazily — either when the thread picks up a new job
(which re-attaches a new tstate, evicting the old
one) or when the thread truly exits.

`_interpreters.destroy(interp_id)` waits for *all*
tstates associated w/ that subint to be released
before it can proceed. If the cached worker is idle
holding the stale tstate, destroy blocks indefinitely.
Whether the race manifests depends on timing — if
the cached thread happens to pick up another job
quickly, destroy unblocks; if it sits idle, we hang.

### Why a dedicated `threading.Thread` fixes it

A plain `threading.Thread(target=_subint_target,
daemon=False)` runs its target once and exits. When
the target returns, OS-thread teardown (`_bootstrap_inner`
→ `_bootstrap`) fires and CPython releases the
tstate for that thread. `_interpreters.destroy()`
then has no blocker.

### Diagnostic tactics that actually helped

1. `faulthandler.dump_traceback_later(n, repeat=False,
   file=open(path, 'w'))` for captured stack dumps on
   hang. Critically, pipe to a `file=` not stderr —
   pytest captures stderr weirdly and the dump is
   easy to miss.
2. A resource-tracker autouse fixture printing
   per-test `threading.active_count()` +
   `len(_interpreters.list_all())` deltas → ruled out
   leak-accumulation theories quickly.
3. Running the hanging test *solo* vs in-suite —
   when solo passes but in-suite hangs, you know
   it's a cross-test state-transfer bug rather than
   a test-internal bug.

### Design synthesis — SIGINT + subints + SC

The user and I walked through the cancellation
semantics of PEP 684/734 subinterpreters in detail.
Key findings we want to preserve:

**Signal delivery in subints (stdlib limitation).**
CPython's signal machinery only delivers signals
(SIGINT included) to the *main thread of the main
interpreter*. Subints cannot install signal handlers
that will ever fire. This is an intentional design
choice in PEP 684 and not expected to change. For
tractor's subint actors, this means:

- Ctrl-C never reaches a subint directly.
- `trio.run()` running on a worker thread (as we do
  for subints) already skips SIGINT handler install
  because `signal.signal()` raises on non-main
  threads.
- The only cancellation surface into a subint is
  our IPC `Portal.cancel_actor()`.

**Legacy-mode subints share the main GIL** (which
our impl uses since `msgspec` lacks PEP 684 support
per `jcrist/msgspec#563`). This means a stuck subint
thread can starve the parent's trio loop during
cancellation — the parent can't even *start* its
teardown handling until the subint yields the GIL.

**Failure modes identified for Phase B.3 audit:**

1. Portal cancel lands cleanly → subint unwinds →
   thread exits → destroy succeeds. (Happy path.)
2. IPC channel is already broken when we try to
   send cancel (e.g., `test_ipc_channel_break_*`)
   → cancel raises `BrokenResourceError` → subint
   keeps running unaware → parent hangs waiting for
   `subint_exited`. This is what breaks
   `test_advanced_faults.py` under subint.
3. Subint is stuck in non-checkpointing Python code
   → portal-cancel msg queued but never processed.
4. Subint is in a shielded cancel scope when cancel
   arrives → delay until shield exits.

**Current teardown has a shield-bug too:**
`trio.CancelScope(shield=True)` wrapping the `finally`
block absorbs Ctrl-C, so even when the user tries
to break out they can't. This is the reason
`test_ipc_channel_break_during_stream[break_parent-...
no_msgstream_aclose]` locks up unkillable.

**B.3 hard-kill fix plan (next commit):**

1. Bound `driver_thread.join()` with
   `trio.move_on_after(HARD_KILL_TIMEOUT)`.
2. If it times out, log a warning naming the
   `interp_id` and switch the driver thread to
   `daemon=True` mode (not actually possible after
   start — so instead create as daemon=True upfront
   and accept the tradeoff of proc-exit not waiting
   for a stuck subint).
3. Best-effort `_interpreters.destroy()`; catch the
   `InterpreterError` if the subint is still running.
4. Document that the leak is real and the only
   escape hatch we have without upstream cooperation.

**Test plan for Phase B.3:**

New `tests/test_subint_cancellation.py` covering:
- SIGINT at spawn
- SIGINT mid-portal-RPC
- SIGINT during shielded section in subint
- Dead-channel cancel (mirror of `test_ipc_channel_
  break_during_stream` minimized)
- Non-checkpointing subint (tight `while True` in
  user code)
- Per-test `pytest-timeout`-style bounds so the
  tests visibly fail instead of wedging the runner

### Sanity-check output (verbatim terminal excerpts)

Post-fix single-test validation:
```
1 passed, 1 warning in 1.06s
```
(same test that was hanging pre-fix:
`test_parent_cancels[...cancel_method=ctx-...False]`)

Full `tests/test_context_stream_semantics.py`
under subint:
```
61 passed, 1 warning in 100.35s (0:01:40)
```
and a clean-cache re-run:
```
61 passed, 1 warning in 100.82s (0:01:40)
```

No regressions on trio backend (same subset):
```
69 passed, 1 skipped, 3 warnings in 89.19s
```

### Commit msg

Also AI-drafted via `/commit-msg` + `rewrap.py
--width 67`. See `git log -1 26fb820`.
Add prompt-IO log for subint destroy-race fix Log the `claude-opus-4-7` session that produced the `_subint.py` dedicated-thread fix (`26fb8206`). Substantive bc the patch was entirely AI-generated; raw log also preserves the CPython-internals research informing Phase B.3 hard-kill work. Prompt-IO: ai/prompt-io/claude/20260418T042526Z_26fb820_prompt_io.md (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code 2026-04-18 20:04:57 +00:00			`---`
			`model: claude-opus-4-7[1m]`
			`service: claude`
			`timestamp: 2026-04-18T04:25:26Z`
			`git_ref: 26fb820`
			`diff_cmd: git diff 26fb820~1..26fb820`
			`---`

			Code generated in this turn was committed as `26fb8206`
			`("Fix subint destroy race via dedicated OS thread").`
			`Per diff-ref mode, per-file code is captured via the`
			`pointer below; the rest is verbatim non-code output`
			`— test-session diagnostics, CPython-internals`
			`reasoning, and the design synthesis that's worth`
			`keeping alongside the code since it informs the`
			`still-pending Phase B.3 hard-kill work.`

			`## Per-file generated content`

			### `tractor/spawn/_subint.py` (modified, +110/-84 LOC)

			> `git diff 26fb820~1..26fb820 -- tractor/spawn/_subint.py`

			`Rewrites the subint driver-thread strategy:`
			- replaces `trio.to_thread.run_sync(_interpreters.exec, ...)`
			with a plain `threading.Thread(target=_subint_target,
			daemon=False)` so the OS thread truly exits after
			`_interpreters.exec()` returns
			- captures a `trio.lowlevel.current_trio_token()` at
			`subint_proc` entry; the driver thread signals
			completion back via `trio.from_thread.run_sync(
			subint_exited.set, trio_token=...)`
			- swallows `trio.RunFinishedError` in the signal path
			`for the case where the parent trio loop has already`
			`exited (process teardown)`
			- in the teardown `finally` off-loads the sync
			`driver_thread.join()` call to `trio.to_thread.run_sync`
			`(a cache thread w/ no subint tstate — so no cache`
			`conflict) to wait for the driver thread to fully`
			exit before calling `_interpreters.destroy()`

			`## Non-code output (verbatim) — the CPython-internals research`

			`### What went wrong before this commit`

			Under `--spawn-backend=subint` on py3.14, most single
			`tests passed but longer runs hung intermittently. The`
			`position of the hang moved between runs (test #22 on`
			`one run, test #53 on another) suggesting a timing-`
			`dependent race rather than a deterministic bug.`

			`faulthandler.dump_traceback_later()` eventually
			`caught a stack with the main thread blocked in`
			`_interpreters.destroy(interp_id)` at `_subint.py:293`.
			`Only 2 threads were alive:`
			- main thread waiting in `_interpreters.destroy()`
			`- one idle trio thread-cache worker in`
			`trio._core._thread_cache._work`

			No subint was still running (`_interpreters.list_all()`
			`showed only the main interp). A resource-tracker`
			`pytest fixture confirmed threads/subints did NOT`
			`accumulate across tests — this was not a leak but a`
			`specific "destroy blocks on cached thread w/ stale`
			`tstate" race.`

			`### Why the race exists`

			`trio.to_thread.run_sync` uses a thread cache to
			`avoid OS-thread creation overhead. When the sync`
			`callable returns, the OS thread is NOT terminated —`
			it's parked in `_thread_cache._work` waiting for the
			`next job. CPython's subinterpreter implementation`
			`attaches a tstate (thread-state object) to each`
			`OS thread that ever entered a subint via`
			`_interpreters.exec()`. That tstate is released
			`lazily — either when the thread picks up a new job`
			`(which re-attaches a new tstate, evicting the old`
			`one) or when the thread truly exits.`

			`_interpreters.destroy(interp_id)` waits for all
			`tstates associated w/ that subint to be released`
			`before it can proceed. If the cached worker is idle`
			`holding the stale tstate, destroy blocks indefinitely.`
			`Whether the race manifests depends on timing — if`
			`the cached thread happens to pick up another job`
			`quickly, destroy unblocks; if it sits idle, we hang.`

			### Why a dedicated `threading.Thread` fixes it

			A plain `threading.Thread(target=_subint_target,
			daemon=False)` runs its target once and exits. When
			the target returns, OS-thread teardown (`_bootstrap_inner`
			→ `_bootstrap`) fires and CPython releases the
			tstate for that thread. `_interpreters.destroy()`
			`then has no blocker.`

			`### Diagnostic tactics that actually helped`

			1. `faulthandler.dump_traceback_later(n, repeat=False,
			file=open(path, 'w'))` for captured stack dumps on
			hang. Critically, pipe to a `file=` not stderr —
			`pytest captures stderr weirdly and the dump is`
			`easy to miss.`
			`2. A resource-tracker autouse fixture printing`
			per-test `threading.active_count()` +
			`len(_interpreters.list_all())` deltas → ruled out
			`leak-accumulation theories quickly.`
			`3. Running the hanging test solo vs in-suite —`
			`when solo passes but in-suite hangs, you know`
			`it's a cross-test state-transfer bug rather than`
			`a test-internal bug.`

			`### Design synthesis — SIGINT + subints + SC`

			`The user and I walked through the cancellation`
			`semantics of PEP 684/734 subinterpreters in detail.`
			`Key findings we want to preserve:`

			`Signal delivery in subints (stdlib limitation).`
			`CPython's signal machinery only delivers signals`
			`(SIGINT included) to the *main thread of the main`
			`interpreter*. Subints cannot install signal handlers`
			`that will ever fire. This is an intentional design`
			`choice in PEP 684 and not expected to change. For`
			`tractor's subint actors, this means:`

			`- Ctrl-C never reaches a subint directly.`
			- `trio.run()` running on a worker thread (as we do
			`for subints) already skips SIGINT handler install`
			because `signal.signal()` raises on non-main
			`threads.`
			`- The only cancellation surface into a subint is`
			our IPC `Portal.cancel_actor()`.

			`Legacy-mode subints share the main GIL (which`
			our impl uses since `msgspec` lacks PEP 684 support
			per `jcrist/msgspec#563`). This means a stuck subint
			`thread can starve the parent's trio loop during`
			`cancellation — the parent can't even start its`
			`teardown handling until the subint yields the GIL.`

			`Failure modes identified for Phase B.3 audit:`

			`1. Portal cancel lands cleanly → subint unwinds →`
			`thread exits → destroy succeeds. (Happy path.)`
			`2. IPC channel is already broken when we try to`
			send cancel (e.g., `test_ipc_channel_break_*`)
			→ cancel raises `BrokenResourceError` → subint
			`keeps running unaware → parent hangs waiting for`
			`subint_exited`. This is what breaks
			`test_advanced_faults.py` under subint.
			`3. Subint is stuck in non-checkpointing Python code`
			`→ portal-cancel msg queued but never processed.`
			`4. Subint is in a shielded cancel scope when cancel`
			`arrives → delay until shield exits.`

			`Current teardown has a shield-bug too:`
			`trio.CancelScope(shield=True)` wrapping the `finally`
			`block absorbs Ctrl-C, so even when the user tries`
			`to break out they can't. This is the reason`
			`test_ipc_channel_break_during_stream[break_parent-...
			no_msgstream_aclose]` locks up unkillable.

			`B.3 hard-kill fix plan (next commit):`

			1. Bound `driver_thread.join()` with
			`trio.move_on_after(HARD_KILL_TIMEOUT)`.
			`2. If it times out, log a warning naming the`
			`interp_id` and switch the driver thread to
			`daemon=True` mode (not actually possible after
			`start — so instead create as daemon=True upfront`
			`and accept the tradeoff of proc-exit not waiting`
			`for a stuck subint).`
			3. Best-effort `_interpreters.destroy()`; catch the
			`InterpreterError` if the subint is still running.
			`4. Document that the leak is real and the only`
			`escape hatch we have without upstream cooperation.`

			`Test plan for Phase B.3:`

			New `tests/test_subint_cancellation.py` covering:
			`- SIGINT at spawn`
			`- SIGINT mid-portal-RPC`
			`- SIGINT during shielded section in subint`
			- Dead-channel cancel (mirror of `test_ipc_channel_
			break_during_stream` minimized)
			- Non-checkpointing subint (tight `while True` in
			`user code)`
			- Per-test `pytest-timeout`-style bounds so the
			`tests visibly fail instead of wedging the runner`

			`### Sanity-check output (verbatim terminal excerpts)`

			`Post-fix single-test validation:`
			```
			`1 passed, 1 warning in 1.06s`
			```
			`(same test that was hanging pre-fix:`
			`test_parent_cancels[...cancel_method=ctx-...False]`)

			Full `tests/test_context_stream_semantics.py`
			`under subint:`
			```
			`61 passed, 1 warning in 100.35s (0:01:40)`
			```
			`and a clean-cache re-run:`
			```
			`61 passed, 1 warning in 100.82s (0:01:40)`
			```

			`No regressions on trio backend (same subset):`
			```
			`69 passed, 1 skipped, 3 warnings in 89.19s`
			```

			`### Commit msg`

			Also AI-drafted via `/commit-msg` + `rewrap.py
			--width 67`. See `git log -1 26fb820`.