--- model: claude-opus-4-7[1m] service: claude timestamp: 2026-04-18T04:25:26Z git_ref: 26fb820 diff_cmd: git diff 26fb820~1..26fb820 --- Code generated in this turn was committed as `26fb8206` ("Fix subint destroy race via dedicated OS thread"). Per diff-ref mode, per-file code is captured via the pointer below; the rest is verbatim non-code output — test-session diagnostics, CPython-internals reasoning, and the design synthesis that's worth keeping alongside the code since it informs the still-pending Phase B.3 hard-kill work. ## Per-file generated content ### `tractor/spawn/_subint.py` (modified, +110/-84 LOC) > `git diff 26fb820~1..26fb820 -- tractor/spawn/_subint.py` Rewrites the subint driver-thread strategy: - replaces `trio.to_thread.run_sync(_interpreters.exec, ...)` with a plain `threading.Thread(target=_subint_target, daemon=False)` so the OS thread truly exits after `_interpreters.exec()` returns - captures a `trio.lowlevel.current_trio_token()` at `subint_proc` entry; the driver thread signals completion back via `trio.from_thread.run_sync( subint_exited.set, trio_token=...)` - swallows `trio.RunFinishedError` in the signal path for the case where the parent trio loop has already exited (process teardown) - in the teardown `finally` off-loads the sync `driver_thread.join()` call to `trio.to_thread.run_sync` (a cache thread w/ no subint tstate — so no cache conflict) to wait for the driver thread to fully exit before calling `_interpreters.destroy()` ## Non-code output (verbatim) — the CPython-internals research ### What went wrong before this commit Under `--spawn-backend=subint` on py3.14, most single tests passed but longer runs hung intermittently. The position of the hang moved between runs (test #22 on one run, test #53 on another) suggesting a timing- dependent race rather than a deterministic bug. `faulthandler.dump_traceback_later()` eventually caught a stack with the main thread blocked in `_interpreters.destroy(interp_id)` at `_subint.py:293`. Only 2 threads were alive: - main thread waiting in `_interpreters.destroy()` - one idle trio thread-cache worker in `trio._core._thread_cache._work` No subint was still running (`_interpreters.list_all()` showed only the main interp). A resource-tracker pytest fixture confirmed threads/subints did NOT accumulate across tests — this was not a leak but a specific "destroy blocks on cached thread w/ stale tstate" race. ### Why the race exists `trio.to_thread.run_sync` uses a thread *cache* to avoid OS-thread creation overhead. When the sync callable returns, the OS thread is NOT terminated — it's parked in `_thread_cache._work` waiting for the next job. CPython's subinterpreter implementation attaches a **tstate** (thread-state object) to each OS thread that ever entered a subint via `_interpreters.exec()`. That tstate is released lazily — either when the thread picks up a new job (which re-attaches a new tstate, evicting the old one) or when the thread truly exits. `_interpreters.destroy(interp_id)` waits for *all* tstates associated w/ that subint to be released before it can proceed. If the cached worker is idle holding the stale tstate, destroy blocks indefinitely. Whether the race manifests depends on timing — if the cached thread happens to pick up another job quickly, destroy unblocks; if it sits idle, we hang. ### Why a dedicated `threading.Thread` fixes it A plain `threading.Thread(target=_subint_target, daemon=False)` runs its target once and exits. When the target returns, OS-thread teardown (`_bootstrap_inner` → `_bootstrap`) fires and CPython releases the tstate for that thread. `_interpreters.destroy()` then has no blocker. ### Diagnostic tactics that actually helped 1. `faulthandler.dump_traceback_later(n, repeat=False, file=open(path, 'w'))` for captured stack dumps on hang. Critically, pipe to a `file=` not stderr — pytest captures stderr weirdly and the dump is easy to miss. 2. A resource-tracker autouse fixture printing per-test `threading.active_count()` + `len(_interpreters.list_all())` deltas → ruled out leak-accumulation theories quickly. 3. Running the hanging test *solo* vs in-suite — when solo passes but in-suite hangs, you know it's a cross-test state-transfer bug rather than a test-internal bug. ### Design synthesis — SIGINT + subints + SC The user and I walked through the cancellation semantics of PEP 684/734 subinterpreters in detail. Key findings we want to preserve: **Signal delivery in subints (stdlib limitation).** CPython's signal machinery only delivers signals (SIGINT included) to the *main thread of the main interpreter*. Subints cannot install signal handlers that will ever fire. This is an intentional design choice in PEP 684 and not expected to change. For tractor's subint actors, this means: - Ctrl-C never reaches a subint directly. - `trio.run()` running on a worker thread (as we do for subints) already skips SIGINT handler install because `signal.signal()` raises on non-main threads. - The only cancellation surface into a subint is our IPC `Portal.cancel_actor()`. **Legacy-mode subints share the main GIL** (which our impl uses since `msgspec` lacks PEP 684 support per `jcrist/msgspec#563`). This means a stuck subint thread can starve the parent's trio loop during cancellation — the parent can't even *start* its teardown handling until the subint yields the GIL. **Failure modes identified for Phase B.3 audit:** 1. Portal cancel lands cleanly → subint unwinds → thread exits → destroy succeeds. (Happy path.) 2. IPC channel is already broken when we try to send cancel (e.g., `test_ipc_channel_break_*`) → cancel raises `BrokenResourceError` → subint keeps running unaware → parent hangs waiting for `subint_exited`. This is what breaks `test_advanced_faults.py` under subint. 3. Subint is stuck in non-checkpointing Python code → portal-cancel msg queued but never processed. 4. Subint is in a shielded cancel scope when cancel arrives → delay until shield exits. **Current teardown has a shield-bug too:** `trio.CancelScope(shield=True)` wrapping the `finally` block absorbs Ctrl-C, so even when the user tries to break out they can't. This is the reason `test_ipc_channel_break_during_stream[break_parent-... no_msgstream_aclose]` locks up unkillable. **B.3 hard-kill fix plan (next commit):** 1. Bound `driver_thread.join()` with `trio.move_on_after(HARD_KILL_TIMEOUT)`. 2. If it times out, log a warning naming the `interp_id` and switch the driver thread to `daemon=True` mode (not actually possible after start — so instead create as daemon=True upfront and accept the tradeoff of proc-exit not waiting for a stuck subint). 3. Best-effort `_interpreters.destroy()`; catch the `InterpreterError` if the subint is still running. 4. Document that the leak is real and the only escape hatch we have without upstream cooperation. **Test plan for Phase B.3:** New `tests/test_subint_cancellation.py` covering: - SIGINT at spawn - SIGINT mid-portal-RPC - SIGINT during shielded section in subint - Dead-channel cancel (mirror of `test_ipc_channel_ break_during_stream` minimized) - Non-checkpointing subint (tight `while True` in user code) - Per-test `pytest-timeout`-style bounds so the tests visibly fail instead of wedging the runner ### Sanity-check output (verbatim terminal excerpts) Post-fix single-test validation: ``` 1 passed, 1 warning in 1.06s ``` (same test that was hanging pre-fix: `test_parent_cancels[...cancel_method=ctx-...False]`) Full `tests/test_context_stream_semantics.py` under subint: ``` 61 passed, 1 warning in 100.35s (0:01:40) ``` and a clean-cache re-run: ``` 61 passed, 1 warning in 100.82s (0:01:40) ``` No regressions on trio backend (same subset): ``` 69 passed, 1 skipped, 3 warnings in 89.19s ``` ### Commit msg Also AI-drafted via `/commit-msg` + `rewrap.py --width 67`. See `git log -1 26fb820`.