Add prompt-IO log for subint destroy-race fix
Log the `claude-opus-4-7` session that produced the `_subint.py` dedicated-thread fix (`26fb8206`). Substantive bc the patch was entirely AI-generated; raw log also preserves the CPython-internals research informing Phase B.3 hard-kill work. Prompt-IO: ai/prompt-io/claude/20260418T042526Z_26fb820_prompt_io.md (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-codesubint_spawner_backend
parent
3869a9b468
commit
b48503085f
|
|
@ -0,0 +1,117 @@
|
||||||
|
---
|
||||||
|
model: claude-opus-4-7[1m]
|
||||||
|
service: claude
|
||||||
|
session: subints-phase-b2-destroy-race-fix
|
||||||
|
timestamp: 2026-04-18T04:25:26Z
|
||||||
|
git_ref: 26fb820
|
||||||
|
scope: code
|
||||||
|
substantive: true
|
||||||
|
raw_file: 20260418T042526Z_26fb820_prompt_io.raw.md
|
||||||
|
---
|
||||||
|
|
||||||
|
## Prompt
|
||||||
|
|
||||||
|
Follow-up to Phase B.2 (`5cd6df58`) after the user
|
||||||
|
observed intermittent mid-suite hangs when running
|
||||||
|
the tractor test suite under `--spawn-backend=subint`
|
||||||
|
on py3.14. The specific sequence of prompts over
|
||||||
|
several turns:
|
||||||
|
|
||||||
|
1. User pointed at the `test_context_stream_semantics.py`
|
||||||
|
suite as the first thing to make run clean under
|
||||||
|
`--spawn-backend=subint`.
|
||||||
|
2. After a series of `timeout`-terminated runs that
|
||||||
|
gave no diagnostic info, user nudged me to stop
|
||||||
|
relying on `timeout` and get actual runtime
|
||||||
|
diagnostics ("the suite hangs indefinitely, so i
|
||||||
|
don't think this `timeout 30` is helping you at
|
||||||
|
all.."). Switched to
|
||||||
|
`faulthandler.dump_traceback_later(...)` and a
|
||||||
|
resource-tracker fixture to rule out leaks.
|
||||||
|
3. Captured a stack pinning the hang on
|
||||||
|
`_interpreters.destroy(interp_id)` in the subint
|
||||||
|
teardown finally block.
|
||||||
|
4. Proposed dedicated-OS-thread fix. User greenlit.
|
||||||
|
5. Implemented + verified on-worktree; user needed
|
||||||
|
to be pointed at the *worktree*'s `./py313` venv
|
||||||
|
because bare `pytest` was picking up the main
|
||||||
|
repo's venv (running un-patched `_subint.py`) and
|
||||||
|
still hanging.
|
||||||
|
|
||||||
|
Running theme over the whole exchange: this patch
|
||||||
|
only closes the *destroy race*. The user and I also
|
||||||
|
traced through the deeper cancellation story — SIGINT
|
||||||
|
can't reach subints, legacy-mode shares the GIL,
|
||||||
|
portal-cancel dies when the IPC channel is already
|
||||||
|
broken — and agreed the next step is a bounded
|
||||||
|
hard-kill in `subint_proc`'s teardown plus a
|
||||||
|
dedicated cancellation test suite. Those land as
|
||||||
|
separate commits.
|
||||||
|
|
||||||
|
## Response summary
|
||||||
|
|
||||||
|
Produced the `tractor/spawn/_subint.py` patch landed
|
||||||
|
as commit `26fb8206` ("Fix subint destroy race via
|
||||||
|
dedicated OS thread"). One file, +110/-84 LOC.
|
||||||
|
|
||||||
|
Mechanism: swap `trio.to_thread.run_sync(_interpreters
|
||||||
|
.exec, ...)` for a plain `threading.Thread(target=...
|
||||||
|
, daemon=False)`. The trio thread cache recycles
|
||||||
|
workers — so the OS thread that ran `_interpreters
|
||||||
|
.exec()` remained alive in the cache holding a
|
||||||
|
stale subint tstate, blocking
|
||||||
|
`_interpreters.destroy()` in the finally indefinitely.
|
||||||
|
A dedicated one-shot thread exits naturally after
|
||||||
|
the sync target returns, releasing tstate and
|
||||||
|
unblocking destroy.
|
||||||
|
|
||||||
|
Coordination across the trio↔thread boundary:
|
||||||
|
- `trio.lowlevel.current_trio_token()` captured at
|
||||||
|
`subint_proc` entry
|
||||||
|
- driver thread signals `subint_exited.set()` back
|
||||||
|
to parent trio via `trio.from_thread.run_sync(...,
|
||||||
|
trio_token=token)` (synchronous from the thread's
|
||||||
|
POV; the call returns after trio has run `.set()`)
|
||||||
|
- `trio.RunFinishedError` swallowed in that path for
|
||||||
|
the process-teardown case where parent trio already
|
||||||
|
exited
|
||||||
|
- teardown `finally` off-loads the sync
|
||||||
|
`driver_thread.join()` via `to_thread.run_sync` (a
|
||||||
|
cache thread carries no subint tstate — safe)
|
||||||
|
|
||||||
|
## Files changed
|
||||||
|
|
||||||
|
See `git diff 26fb820~1..26fb820 --stat`:
|
||||||
|
|
||||||
|
```
|
||||||
|
tractor/spawn/_subint.py | 194 +++++++++++++++++++------------
|
||||||
|
1 file changed, 110 insertions(+), 84 deletions(-)
|
||||||
|
```
|
||||||
|
|
||||||
|
Validation:
|
||||||
|
- `test_parent_cancels[chk_ctx_result_before_exit=True-
|
||||||
|
cancel_method=ctx-child_returns_early=False]`
|
||||||
|
(the specific test that was hanging for the user)
|
||||||
|
— passed in 1.06s.
|
||||||
|
- Full `tests/test_context_stream_semantics.py` under
|
||||||
|
subint — 61 passed in 100.35s (clean-cache re-run:
|
||||||
|
100.82s).
|
||||||
|
- Trio backend regression subset — 69 passed / 1
|
||||||
|
skipped / 89.19s — no regressions from this change.
|
||||||
|
|
||||||
|
## Files changed
|
||||||
|
|
||||||
|
Beyond the `_subint.py` patch, the raw log also
|
||||||
|
records the cancellation-semantics research that
|
||||||
|
spanned this conversation but did not ship as code
|
||||||
|
in *this* commit. Preserving it inline under "Non-
|
||||||
|
code output" because it directly informs the
|
||||||
|
Phase B.3 hard-kill impl that will follow (and any
|
||||||
|
upstream CPython bug reports we end up filing).
|
||||||
|
|
||||||
|
## Human edits
|
||||||
|
|
||||||
|
None — committed as generated. The commit message
|
||||||
|
itself was also AI-drafted via `/commit-msg` and
|
||||||
|
rewrapped via the project's `rewrap.py --width 67`
|
||||||
|
tooling; user landed it without edits.
|
||||||
|
|
@ -0,0 +1,220 @@
|
||||||
|
---
|
||||||
|
model: claude-opus-4-7[1m]
|
||||||
|
service: claude
|
||||||
|
timestamp: 2026-04-18T04:25:26Z
|
||||||
|
git_ref: 26fb820
|
||||||
|
diff_cmd: git diff 26fb820~1..26fb820
|
||||||
|
---
|
||||||
|
|
||||||
|
Code generated in this turn was committed as `26fb8206`
|
||||||
|
("Fix subint destroy race via dedicated OS thread").
|
||||||
|
Per diff-ref mode, per-file code is captured via the
|
||||||
|
pointer below; the rest is verbatim non-code output
|
||||||
|
— test-session diagnostics, CPython-internals
|
||||||
|
reasoning, and the design synthesis that's worth
|
||||||
|
keeping alongside the code since it informs the
|
||||||
|
still-pending Phase B.3 hard-kill work.
|
||||||
|
|
||||||
|
## Per-file generated content
|
||||||
|
|
||||||
|
### `tractor/spawn/_subint.py` (modified, +110/-84 LOC)
|
||||||
|
|
||||||
|
> `git diff 26fb820~1..26fb820 -- tractor/spawn/_subint.py`
|
||||||
|
|
||||||
|
Rewrites the subint driver-thread strategy:
|
||||||
|
- replaces `trio.to_thread.run_sync(_interpreters.exec, ...)`
|
||||||
|
with a plain `threading.Thread(target=_subint_target,
|
||||||
|
daemon=False)` so the OS thread truly exits after
|
||||||
|
`_interpreters.exec()` returns
|
||||||
|
- captures a `trio.lowlevel.current_trio_token()` at
|
||||||
|
`subint_proc` entry; the driver thread signals
|
||||||
|
completion back via `trio.from_thread.run_sync(
|
||||||
|
subint_exited.set, trio_token=...)`
|
||||||
|
- swallows `trio.RunFinishedError` in the signal path
|
||||||
|
for the case where the parent trio loop has already
|
||||||
|
exited (process teardown)
|
||||||
|
- in the teardown `finally` off-loads the sync
|
||||||
|
`driver_thread.join()` call to `trio.to_thread.run_sync`
|
||||||
|
(a cache thread w/ no subint tstate — so no cache
|
||||||
|
conflict) to wait for the driver thread to fully
|
||||||
|
exit before calling `_interpreters.destroy()`
|
||||||
|
|
||||||
|
## Non-code output (verbatim) — the CPython-internals research
|
||||||
|
|
||||||
|
### What went wrong before this commit
|
||||||
|
|
||||||
|
Under `--spawn-backend=subint` on py3.14, most single
|
||||||
|
tests passed but longer runs hung intermittently. The
|
||||||
|
position of the hang moved between runs (test #22 on
|
||||||
|
one run, test #53 on another) suggesting a timing-
|
||||||
|
dependent race rather than a deterministic bug.
|
||||||
|
|
||||||
|
`faulthandler.dump_traceback_later()` eventually
|
||||||
|
caught a stack with the main thread blocked in
|
||||||
|
`_interpreters.destroy(interp_id)` at `_subint.py:293`.
|
||||||
|
Only 2 threads were alive:
|
||||||
|
- main thread waiting in `_interpreters.destroy()`
|
||||||
|
- one idle trio thread-cache worker in
|
||||||
|
`trio._core._thread_cache._work`
|
||||||
|
|
||||||
|
No subint was still running (`_interpreters.list_all()`
|
||||||
|
showed only the main interp). A resource-tracker
|
||||||
|
pytest fixture confirmed threads/subints did NOT
|
||||||
|
accumulate across tests — this was not a leak but a
|
||||||
|
specific "destroy blocks on cached thread w/ stale
|
||||||
|
tstate" race.
|
||||||
|
|
||||||
|
### Why the race exists
|
||||||
|
|
||||||
|
`trio.to_thread.run_sync` uses a thread *cache* to
|
||||||
|
avoid OS-thread creation overhead. When the sync
|
||||||
|
callable returns, the OS thread is NOT terminated —
|
||||||
|
it's parked in `_thread_cache._work` waiting for the
|
||||||
|
next job. CPython's subinterpreter implementation
|
||||||
|
attaches a **tstate** (thread-state object) to each
|
||||||
|
OS thread that ever entered a subint via
|
||||||
|
`_interpreters.exec()`. That tstate is released
|
||||||
|
lazily — either when the thread picks up a new job
|
||||||
|
(which re-attaches a new tstate, evicting the old
|
||||||
|
one) or when the thread truly exits.
|
||||||
|
|
||||||
|
`_interpreters.destroy(interp_id)` waits for *all*
|
||||||
|
tstates associated w/ that subint to be released
|
||||||
|
before it can proceed. If the cached worker is idle
|
||||||
|
holding the stale tstate, destroy blocks indefinitely.
|
||||||
|
Whether the race manifests depends on timing — if
|
||||||
|
the cached thread happens to pick up another job
|
||||||
|
quickly, destroy unblocks; if it sits idle, we hang.
|
||||||
|
|
||||||
|
### Why a dedicated `threading.Thread` fixes it
|
||||||
|
|
||||||
|
A plain `threading.Thread(target=_subint_target,
|
||||||
|
daemon=False)` runs its target once and exits. When
|
||||||
|
the target returns, OS-thread teardown (`_bootstrap_inner`
|
||||||
|
→ `_bootstrap`) fires and CPython releases the
|
||||||
|
tstate for that thread. `_interpreters.destroy()`
|
||||||
|
then has no blocker.
|
||||||
|
|
||||||
|
### Diagnostic tactics that actually helped
|
||||||
|
|
||||||
|
1. `faulthandler.dump_traceback_later(n, repeat=False,
|
||||||
|
file=open(path, 'w'))` for captured stack dumps on
|
||||||
|
hang. Critically, pipe to a `file=` not stderr —
|
||||||
|
pytest captures stderr weirdly and the dump is
|
||||||
|
easy to miss.
|
||||||
|
2. A resource-tracker autouse fixture printing
|
||||||
|
per-test `threading.active_count()` +
|
||||||
|
`len(_interpreters.list_all())` deltas → ruled out
|
||||||
|
leak-accumulation theories quickly.
|
||||||
|
3. Running the hanging test *solo* vs in-suite —
|
||||||
|
when solo passes but in-suite hangs, you know
|
||||||
|
it's a cross-test state-transfer bug rather than
|
||||||
|
a test-internal bug.
|
||||||
|
|
||||||
|
### Design synthesis — SIGINT + subints + SC
|
||||||
|
|
||||||
|
The user and I walked through the cancellation
|
||||||
|
semantics of PEP 684/734 subinterpreters in detail.
|
||||||
|
Key findings we want to preserve:
|
||||||
|
|
||||||
|
**Signal delivery in subints (stdlib limitation).**
|
||||||
|
CPython's signal machinery only delivers signals
|
||||||
|
(SIGINT included) to the *main thread of the main
|
||||||
|
interpreter*. Subints cannot install signal handlers
|
||||||
|
that will ever fire. This is an intentional design
|
||||||
|
choice in PEP 684 and not expected to change. For
|
||||||
|
tractor's subint actors, this means:
|
||||||
|
|
||||||
|
- Ctrl-C never reaches a subint directly.
|
||||||
|
- `trio.run()` running on a worker thread (as we do
|
||||||
|
for subints) already skips SIGINT handler install
|
||||||
|
because `signal.signal()` raises on non-main
|
||||||
|
threads.
|
||||||
|
- The only cancellation surface into a subint is
|
||||||
|
our IPC `Portal.cancel_actor()`.
|
||||||
|
|
||||||
|
**Legacy-mode subints share the main GIL** (which
|
||||||
|
our impl uses since `msgspec` lacks PEP 684 support
|
||||||
|
per `jcrist/msgspec#563`). This means a stuck subint
|
||||||
|
thread can starve the parent's trio loop during
|
||||||
|
cancellation — the parent can't even *start* its
|
||||||
|
teardown handling until the subint yields the GIL.
|
||||||
|
|
||||||
|
**Failure modes identified for Phase B.3 audit:**
|
||||||
|
|
||||||
|
1. Portal cancel lands cleanly → subint unwinds →
|
||||||
|
thread exits → destroy succeeds. (Happy path.)
|
||||||
|
2. IPC channel is already broken when we try to
|
||||||
|
send cancel (e.g., `test_ipc_channel_break_*`)
|
||||||
|
→ cancel raises `BrokenResourceError` → subint
|
||||||
|
keeps running unaware → parent hangs waiting for
|
||||||
|
`subint_exited`. This is what breaks
|
||||||
|
`test_advanced_faults.py` under subint.
|
||||||
|
3. Subint is stuck in non-checkpointing Python code
|
||||||
|
→ portal-cancel msg queued but never processed.
|
||||||
|
4. Subint is in a shielded cancel scope when cancel
|
||||||
|
arrives → delay until shield exits.
|
||||||
|
|
||||||
|
**Current teardown has a shield-bug too:**
|
||||||
|
`trio.CancelScope(shield=True)` wrapping the `finally`
|
||||||
|
block absorbs Ctrl-C, so even when the user tries
|
||||||
|
to break out they can't. This is the reason
|
||||||
|
`test_ipc_channel_break_during_stream[break_parent-...
|
||||||
|
no_msgstream_aclose]` locks up unkillable.
|
||||||
|
|
||||||
|
**B.3 hard-kill fix plan (next commit):**
|
||||||
|
|
||||||
|
1. Bound `driver_thread.join()` with
|
||||||
|
`trio.move_on_after(HARD_KILL_TIMEOUT)`.
|
||||||
|
2. If it times out, log a warning naming the
|
||||||
|
`interp_id` and switch the driver thread to
|
||||||
|
`daemon=True` mode (not actually possible after
|
||||||
|
start — so instead create as daemon=True upfront
|
||||||
|
and accept the tradeoff of proc-exit not waiting
|
||||||
|
for a stuck subint).
|
||||||
|
3. Best-effort `_interpreters.destroy()`; catch the
|
||||||
|
`InterpreterError` if the subint is still running.
|
||||||
|
4. Document that the leak is real and the only
|
||||||
|
escape hatch we have without upstream cooperation.
|
||||||
|
|
||||||
|
**Test plan for Phase B.3:**
|
||||||
|
|
||||||
|
New `tests/test_subint_cancellation.py` covering:
|
||||||
|
- SIGINT at spawn
|
||||||
|
- SIGINT mid-portal-RPC
|
||||||
|
- SIGINT during shielded section in subint
|
||||||
|
- Dead-channel cancel (mirror of `test_ipc_channel_
|
||||||
|
break_during_stream` minimized)
|
||||||
|
- Non-checkpointing subint (tight `while True` in
|
||||||
|
user code)
|
||||||
|
- Per-test `pytest-timeout`-style bounds so the
|
||||||
|
tests visibly fail instead of wedging the runner
|
||||||
|
|
||||||
|
### Sanity-check output (verbatim terminal excerpts)
|
||||||
|
|
||||||
|
Post-fix single-test validation:
|
||||||
|
```
|
||||||
|
1 passed, 1 warning in 1.06s
|
||||||
|
```
|
||||||
|
(same test that was hanging pre-fix:
|
||||||
|
`test_parent_cancels[...cancel_method=ctx-...False]`)
|
||||||
|
|
||||||
|
Full `tests/test_context_stream_semantics.py`
|
||||||
|
under subint:
|
||||||
|
```
|
||||||
|
61 passed, 1 warning in 100.35s (0:01:40)
|
||||||
|
```
|
||||||
|
and a clean-cache re-run:
|
||||||
|
```
|
||||||
|
61 passed, 1 warning in 100.82s (0:01:40)
|
||||||
|
```
|
||||||
|
|
||||||
|
No regressions on trio backend (same subset):
|
||||||
|
```
|
||||||
|
69 passed, 1 skipped, 3 warnings in 89.19s
|
||||||
|
```
|
||||||
|
|
||||||
|
### Commit msg
|
||||||
|
|
||||||
|
Also AI-drafted via `/commit-msg` + `rewrap.py
|
||||||
|
--width 67`. See `git log -1 26fb820`.
|
||||||
Loading…
Reference in New Issue