tractor/ai/prompt-io/claude/20260418T042526Z_26fb820_pr...

7.8 KiB
Raw Blame History

Code generated in this turn was committed as 26fb8206 (“Fix subint destroy race via dedicated OS thread”). Per diff-ref mode, per-file code is captured via the pointer below; the rest is verbatim non-code output — test-session diagnostics, CPython-internals reasoning, and the design synthesis thats worth keeping alongside the code since it informs the still-pending Phase B.3 hard-kill work.

Per-file generated content

tractor/spawn/_subint.py (modified, +110/-84 LOC)

git diff 26fb820~1..26fb820 -- tractor/spawn/_subint.py

Rewrites the subint driver-thread strategy: - replaces trio.to_thread.run_sync(_interpreters.exec, ...) with a plain threading.Thread(target=_subint_target, daemon=False) so the OS thread truly exits after _interpreters.exec() returns - captures a trio.lowlevel.current_trio_token() at subint_proc entry; the driver thread signals completion back via trio.from_thread.run_sync( subint_exited.set, trio_token=...) - swallows trio.RunFinishedError in the signal path for the case where the parent trio loop has already exited (process teardown) - in the teardown finally off-loads the sync driver_thread.join() call to trio.to_thread.run_sync (a cache thread w/ no subint tstate — so no cache conflict) to wait for the driver thread to fully exit before calling _interpreters.destroy()

Non-code output (verbatim) — the CPython-internals research

What went wrong before this commit

Under --spawn-backend=subint on py3.14, most single tests passed but longer runs hung intermittently. The position of the hang moved between runs (test #22 on one run, test #53 on another) suggesting a timing- dependent race rather than a deterministic bug.

faulthandler.dump_traceback_later() eventually caught a stack with the main thread blocked in _interpreters.destroy(interp_id) at _subint.py:293. Only 2 threads were alive: - main thread waiting in _interpreters.destroy() - one idle trio thread-cache worker in trio._core._thread_cache._work

No subint was still running (_interpreters.list_all() showed only the main interp). A resource-tracker pytest fixture confirmed threads/subints did NOT accumulate across tests — this was not a leak but a specific “destroy blocks on cached thread w/ stale tstate” race.

Why the race exists

trio.to_thread.run_sync uses a thread cache to avoid OS-thread creation overhead. When the sync callable returns, the OS thread is NOT terminated — its parked in _thread_cache._work waiting for the next job. CPythons subinterpreter implementation attaches a tstate (thread-state object) to each OS thread that ever entered a subint via _interpreters.exec(). That tstate is released lazily — either when the thread picks up a new job (which re-attaches a new tstate, evicting the old one) or when the thread truly exits.

_interpreters.destroy(interp_id) waits for all tstates associated w/ that subint to be released before it can proceed. If the cached worker is idle holding the stale tstate, destroy blocks indefinitely. Whether the race manifests depends on timing — if the cached thread happens to pick up another job quickly, destroy unblocks; if it sits idle, we hang.

Why a dedicated threading.Thread fixes it

A plain threading.Thread(target=_subint_target, daemon=False) runs its target once and exits. When the target returns, OS-thread teardown (_bootstrap_inner_bootstrap) fires and CPython releases the tstate for that thread. _interpreters.destroy() then has no blocker.

Diagnostic tactics that actually helped

  1. faulthandler.dump_traceback_later(n, repeat=False, file=open(path, 'w')) for captured stack dumps on hang. Critically, pipe to a file= not stderr — pytest captures stderr weirdly and the dump is easy to miss.
  2. A resource-tracker autouse fixture printing per-test threading.active_count() + len(_interpreters.list_all()) deltas → ruled out leak-accumulation theories quickly.
  3. Running the hanging test solo vs in-suite — when solo passes but in-suite hangs, you know its a cross-test state-transfer bug rather than a test-internal bug.

Design synthesis — SIGINT + subints + SC

The user and I walked through the cancellation semantics of PEP 684/734 subinterpreters in detail. Key findings we want to preserve:

Signal delivery in subints (stdlib limitation). CPythons signal machinery only delivers signals (SIGINT included) to the main thread of the main interpreter. Subints cannot install signal handlers that will ever fire. This is an intentional design choice in PEP 684 and not expected to change. For tractors subint actors, this means:

  • Ctrl-C never reaches a subint directly.
  • trio.run() running on a worker thread (as we do for subints) already skips SIGINT handler install because signal.signal() raises on non-main threads.
  • The only cancellation surface into a subint is our IPC Portal.cancel_actor().

Legacy-mode subints share the main GIL (which our impl uses since msgspec lacks PEP 684 support per jcrist/msgspec#563). This means a stuck subint thread can starve the parents trio loop during cancellation — the parent cant even start its teardown handling until the subint yields the GIL.

Failure modes identified for Phase B.3 audit:

  1. Portal cancel lands cleanly → subint unwinds → thread exits → destroy succeeds. (Happy path.)
  2. IPC channel is already broken when we try to send cancel (e.g., test_ipc_channel_break_*) → cancel raises BrokenResourceError → subint keeps running unaware → parent hangs waiting for subint_exited. This is what breaks test_advanced_faults.py under subint.
  3. Subint is stuck in non-checkpointing Python code → portal-cancel msg queued but never processed.
  4. Subint is in a shielded cancel scope when cancel arrives → delay until shield exits.

Current teardown has a shield-bug too: trio.CancelScope(shield=True) wrapping the finally block absorbs Ctrl-C, so even when the user tries to break out they cant. This is the reason test_ipc_channel_break_during_stream[break_parent-... no_msgstream_aclose] locks up unkillable.

B.3 hard-kill fix plan (next commit):

  1. Bound driver_thread.join() with trio.move_on_after(HARD_KILL_TIMEOUT).
  2. If it times out, log a warning naming the interp_id and switch the driver thread to daemon=True mode (not actually possible after start — so instead create as daemon=True upfront and accept the tradeoff of proc-exit not waiting for a stuck subint).
  3. Best-effort _interpreters.destroy(); catch the InterpreterError if the subint is still running.
  4. Document that the leak is real and the only escape hatch we have without upstream cooperation.

Test plan for Phase B.3:

New tests/test_subint_cancellation.py covering: - SIGINT at spawn - SIGINT mid-portal-RPC - SIGINT during shielded section in subint - Dead-channel cancel (mirror of test_ipc_channel_ break_during_stream minimized) - Non-checkpointing subint (tight while True in user code) - Per-test pytest-timeout-style bounds so the tests visibly fail instead of wedging the runner

Sanity-check output (verbatim terminal excerpts)

Post-fix single-test validation:

1 passed, 1 warning in 1.06s

(same test that was hanging pre-fix: test_parent_cancels[...cancel_method=ctx-...False])

Full tests/test_context_stream_semantics.py under subint:

61 passed, 1 warning in 100.35s (0:01:40)

and a clean-cache re-run:

61 passed, 1 warning in 100.82s (0:01:40)

No regressions on trio backend (same subset):

69 passed, 1 skipped, 3 warnings in 89.19s

Commit msg

Also AI-drafted via /commit-msg + rewrap.py --width 67. See git log -1 26fb820.