4.1 KiB
Prompt
Follow-up to Phase B.2 (5cd6df58) after the user observed intermittent mid-suite hangs when running the tractor test suite under --spawn-backend=subint on py3.14. The specific sequence of prompts over several turns:
- User pointed at the
test_context_stream_semantics.pysuite as the first thing to make run clean under--spawn-backend=subint. - After a series of
timeout-terminated runs that gave no diagnostic info, user nudged me to stop relying ontimeoutand get actual runtime diagnostics (“the suite hangs indefinitely, so i don’t think thistimeout 30is helping you at all..”). Switched tofaulthandler.dump_traceback_later(...)and a resource-tracker fixture to rule out leaks. - Captured a stack pinning the hang on
_interpreters.destroy(interp_id)in the subint teardown finally block. - Proposed dedicated-OS-thread fix. User greenlit.
- Implemented + verified on-worktree; user needed to be pointed at the worktree’s
./py313venv because barepytestwas picking up the main repo’s venv (running un-patched_subint.py) and still hanging.
Running theme over the whole exchange: this patch only closes the destroy race. The user and I also traced through the deeper cancellation story — SIGINT can’t reach subints, legacy-mode shares the GIL, portal-cancel dies when the IPC channel is already broken — and agreed the next step is a bounded hard-kill in subint_proc’s teardown plus a dedicated cancellation test suite. Those land as separate commits.
Response summary
Produced the tractor/spawn/_subint.py patch landed as commit 26fb8206 (“Fix subint destroy race via dedicated OS thread”). One file, +110/-84 LOC.
Mechanism: swap trio.to_thread.run_sync(_interpreters .exec, ...) for a plain threading.Thread(target=... , daemon=False). The trio thread cache recycles workers — so the OS thread that ran _interpreters .exec() remained alive in the cache holding a stale subint tstate, blocking _interpreters.destroy() in the finally indefinitely. A dedicated one-shot thread exits naturally after the sync target returns, releasing tstate and unblocking destroy.
Coordination across the trio↔︎thread boundary: - trio.lowlevel.current_trio_token() captured at subint_proc entry - driver thread signals subint_exited.set() back to parent trio via trio.from_thread.run_sync(..., trio_token=token) (synchronous from the thread’s POV; the call returns after trio has run .set()) - trio.RunFinishedError swallowed in that path for the process-teardown case where parent trio already exited - teardown finally off-loads the sync driver_thread.join() via to_thread.run_sync (a cache thread carries no subint tstate — safe)
Files changed
See git diff 26fb820~1..26fb820 --stat:
tractor/spawn/_subint.py | 194 +++++++++++++++++++------------
1 file changed, 110 insertions(+), 84 deletions(-)
Validation: - test_parent_cancels[chk_ctx_result_before_exit=True- cancel_method=ctx-child_returns_early=False] (the specific test that was hanging for the user) — passed in 1.06s. - Full tests/test_context_stream_semantics.py under subint — 61 passed in 100.35s (clean-cache re-run: 100.82s). - Trio backend regression subset — 69 passed / 1 skipped / 89.19s — no regressions from this change.
Files changed
Beyond the _subint.py patch, the raw log also records the cancellation-semantics research that spanned this conversation but did not ship as code in this commit. Preserving it inline under “Non- code output” because it directly informs the Phase B.3 hard-kill impl that will follow (and any upstream CPython bug reports we end up filing).
Human edits
None — committed as generated. The commit message itself was also AI-drafted via /commit-msg and rewrapped via the project’s rewrap.py --width 67 tooling; user landed it without edits.