118 lines
4.1 KiB
Markdown
118 lines
4.1 KiB
Markdown
|
|
---
|
||
|
|
model: claude-opus-4-7[1m]
|
||
|
|
service: claude
|
||
|
|
session: subints-phase-b2-destroy-race-fix
|
||
|
|
timestamp: 2026-04-18T04:25:26Z
|
||
|
|
git_ref: 26fb820
|
||
|
|
scope: code
|
||
|
|
substantive: true
|
||
|
|
raw_file: 20260418T042526Z_26fb820_prompt_io.raw.md
|
||
|
|
---
|
||
|
|
|
||
|
|
## Prompt
|
||
|
|
|
||
|
|
Follow-up to Phase B.2 (`5cd6df58`) after the user
|
||
|
|
observed intermittent mid-suite hangs when running
|
||
|
|
the tractor test suite under `--spawn-backend=subint`
|
||
|
|
on py3.14. The specific sequence of prompts over
|
||
|
|
several turns:
|
||
|
|
|
||
|
|
1. User pointed at the `test_context_stream_semantics.py`
|
||
|
|
suite as the first thing to make run clean under
|
||
|
|
`--spawn-backend=subint`.
|
||
|
|
2. After a series of `timeout`-terminated runs that
|
||
|
|
gave no diagnostic info, user nudged me to stop
|
||
|
|
relying on `timeout` and get actual runtime
|
||
|
|
diagnostics ("the suite hangs indefinitely, so i
|
||
|
|
don't think this `timeout 30` is helping you at
|
||
|
|
all.."). Switched to
|
||
|
|
`faulthandler.dump_traceback_later(...)` and a
|
||
|
|
resource-tracker fixture to rule out leaks.
|
||
|
|
3. Captured a stack pinning the hang on
|
||
|
|
`_interpreters.destroy(interp_id)` in the subint
|
||
|
|
teardown finally block.
|
||
|
|
4. Proposed dedicated-OS-thread fix. User greenlit.
|
||
|
|
5. Implemented + verified on-worktree; user needed
|
||
|
|
to be pointed at the *worktree*'s `./py313` venv
|
||
|
|
because bare `pytest` was picking up the main
|
||
|
|
repo's venv (running un-patched `_subint.py`) and
|
||
|
|
still hanging.
|
||
|
|
|
||
|
|
Running theme over the whole exchange: this patch
|
||
|
|
only closes the *destroy race*. The user and I also
|
||
|
|
traced through the deeper cancellation story — SIGINT
|
||
|
|
can't reach subints, legacy-mode shares the GIL,
|
||
|
|
portal-cancel dies when the IPC channel is already
|
||
|
|
broken — and agreed the next step is a bounded
|
||
|
|
hard-kill in `subint_proc`'s teardown plus a
|
||
|
|
dedicated cancellation test suite. Those land as
|
||
|
|
separate commits.
|
||
|
|
|
||
|
|
## Response summary
|
||
|
|
|
||
|
|
Produced the `tractor/spawn/_subint.py` patch landed
|
||
|
|
as commit `26fb8206` ("Fix subint destroy race via
|
||
|
|
dedicated OS thread"). One file, +110/-84 LOC.
|
||
|
|
|
||
|
|
Mechanism: swap `trio.to_thread.run_sync(_interpreters
|
||
|
|
.exec, ...)` for a plain `threading.Thread(target=...
|
||
|
|
, daemon=False)`. The trio thread cache recycles
|
||
|
|
workers — so the OS thread that ran `_interpreters
|
||
|
|
.exec()` remained alive in the cache holding a
|
||
|
|
stale subint tstate, blocking
|
||
|
|
`_interpreters.destroy()` in the finally indefinitely.
|
||
|
|
A dedicated one-shot thread exits naturally after
|
||
|
|
the sync target returns, releasing tstate and
|
||
|
|
unblocking destroy.
|
||
|
|
|
||
|
|
Coordination across the trio↔thread boundary:
|
||
|
|
- `trio.lowlevel.current_trio_token()` captured at
|
||
|
|
`subint_proc` entry
|
||
|
|
- driver thread signals `subint_exited.set()` back
|
||
|
|
to parent trio via `trio.from_thread.run_sync(...,
|
||
|
|
trio_token=token)` (synchronous from the thread's
|
||
|
|
POV; the call returns after trio has run `.set()`)
|
||
|
|
- `trio.RunFinishedError` swallowed in that path for
|
||
|
|
the process-teardown case where parent trio already
|
||
|
|
exited
|
||
|
|
- teardown `finally` off-loads the sync
|
||
|
|
`driver_thread.join()` via `to_thread.run_sync` (a
|
||
|
|
cache thread carries no subint tstate — safe)
|
||
|
|
|
||
|
|
## Files changed
|
||
|
|
|
||
|
|
See `git diff 26fb820~1..26fb820 --stat`:
|
||
|
|
|
||
|
|
```
|
||
|
|
tractor/spawn/_subint.py | 194 +++++++++++++++++++------------
|
||
|
|
1 file changed, 110 insertions(+), 84 deletions(-)
|
||
|
|
```
|
||
|
|
|
||
|
|
Validation:
|
||
|
|
- `test_parent_cancels[chk_ctx_result_before_exit=True-
|
||
|
|
cancel_method=ctx-child_returns_early=False]`
|
||
|
|
(the specific test that was hanging for the user)
|
||
|
|
— passed in 1.06s.
|
||
|
|
- Full `tests/test_context_stream_semantics.py` under
|
||
|
|
subint — 61 passed in 100.35s (clean-cache re-run:
|
||
|
|
100.82s).
|
||
|
|
- Trio backend regression subset — 69 passed / 1
|
||
|
|
skipped / 89.19s — no regressions from this change.
|
||
|
|
|
||
|
|
## Files changed
|
||
|
|
|
||
|
|
Beyond the `_subint.py` patch, the raw log also
|
||
|
|
records the cancellation-semantics research that
|
||
|
|
spanned this conversation but did not ship as code
|
||
|
|
in *this* commit. Preserving it inline under "Non-
|
||
|
|
code output" because it directly informs the
|
||
|
|
Phase B.3 hard-kill impl that will follow (and any
|
||
|
|
upstream CPython bug reports we end up filing).
|
||
|
|
|
||
|
|
## Human edits
|
||
|
|
|
||
|
|
None — committed as generated. The commit message
|
||
|
|
itself was also AI-drafted via `/commit-msg` and
|
||
|
|
rewrapped via the project's `rewrap.py --width 67`
|
||
|
|
tooling; user landed it without edits.
|