tractor/ai/prompt-io/claude/20260418T042526Z_26fb820_pr...

4.1 KiB
Raw Blame History

Prompt

Follow-up to Phase B.2 (5cd6df58) after the user observed intermittent mid-suite hangs when running the tractor test suite under --spawn-backend=subint on py3.14. The specific sequence of prompts over several turns:

  1. User pointed at the test_context_stream_semantics.py suite as the first thing to make run clean under --spawn-backend=subint.
  2. After a series of timeout-terminated runs that gave no diagnostic info, user nudged me to stop relying on timeout and get actual runtime diagnostics (“the suite hangs indefinitely, so i dont think this timeout 30 is helping you at all..”). Switched to faulthandler.dump_traceback_later(...) and a resource-tracker fixture to rule out leaks.
  3. Captured a stack pinning the hang on _interpreters.destroy(interp_id) in the subint teardown finally block.
  4. Proposed dedicated-OS-thread fix. User greenlit.
  5. Implemented + verified on-worktree; user needed to be pointed at the worktrees ./py313 venv because bare pytest was picking up the main repos venv (running un-patched _subint.py) and still hanging.

Running theme over the whole exchange: this patch only closes the destroy race. The user and I also traced through the deeper cancellation story — SIGINT cant reach subints, legacy-mode shares the GIL, portal-cancel dies when the IPC channel is already broken — and agreed the next step is a bounded hard-kill in subint_procs teardown plus a dedicated cancellation test suite. Those land as separate commits.

Response summary

Produced the tractor/spawn/_subint.py patch landed as commit 26fb8206 (“Fix subint destroy race via dedicated OS thread”). One file, +110/-84 LOC.

Mechanism: swap trio.to_thread.run_sync(_interpreters .exec, ...) for a plain threading.Thread(target=... , daemon=False). The trio thread cache recycles workers — so the OS thread that ran _interpreters .exec() remained alive in the cache holding a stale subint tstate, blocking _interpreters.destroy() in the finally indefinitely. A dedicated one-shot thread exits naturally after the sync target returns, releasing tstate and unblocking destroy.

Coordination across the trio↔thread boundary: - trio.lowlevel.current_trio_token() captured at subint_proc entry - driver thread signals subint_exited.set() back to parent trio via trio.from_thread.run_sync(..., trio_token=token) (synchronous from the threads POV; the call returns after trio has run .set()) - trio.RunFinishedError swallowed in that path for the process-teardown case where parent trio already exited - teardown finally off-loads the sync driver_thread.join() via to_thread.run_sync (a cache thread carries no subint tstate — safe)

Files changed

See git diff 26fb820~1..26fb820 --stat:

 tractor/spawn/_subint.py | 194 +++++++++++++++++++------------
 1 file changed, 110 insertions(+), 84 deletions(-)

Validation: - test_parent_cancels[chk_ctx_result_before_exit=True- cancel_method=ctx-child_returns_early=False] (the specific test that was hanging for the user) — passed in 1.06s. - Full tests/test_context_stream_semantics.py under subint — 61 passed in 100.35s (clean-cache re-run: 100.82s). - Trio backend regression subset — 69 passed / 1 skipped / 89.19s — no regressions from this change.

Files changed

Beyond the _subint.py patch, the raw log also records the cancellation-semantics research that spanned this conversation but did not ship as code in this commit. Preserving it inline under “Non- code output” because it directly informs the Phase B.3 hard-kill impl that will follow (and any upstream CPython bug reports we end up filing).

Human edits

None — committed as generated. The commit message itself was also AI-drafted via /commit-msg and rewrapped via the projects rewrap.py --width 67 tooling; user landed it without edits.