--- model: claude-opus-4-7[1m] service: claude timestamp: 2026-04-20T19:27:39Z git_ref: HEAD (pre-commit; will land on branch `subint_spawner_backend`) diff_cmd: git diff HEAD~1..HEAD --- Collab between `goodboy` (user) and `claude` (this assistant) spanning multiple test-run iterations on branch `subint_spawner_backend`. The user ran the failing suites, captured `strace` evidence on the hung pytest pids, and set the direction ("these are two different hangs — write them up separately so we don't re-confuse ourselves later"). The assistant aggregated prior-session findings (Phase B.2/B.3 bringup) into two classification docs + test-side cross-links. All prose was jointly iterated; the user had final say on framing and decided which candidate fix directions to list. ## Per-file generated content ### `ai/conc-anal/subint_sigint_starvation_issue.md` (new, 205 LOC) > `git diff HEAD~1..HEAD -- ai/conc-anal/subint_sigint_starvation_issue.md` Writes up the "abandoned-legacy-subint thread wedges the parent trio loop" class. Key sections: - **Symptom** — `test_stale_entry_is_deleted[subint]` hangs indefinitely AND is un-Ctrl-C-able. - **Evidence** — annotated `strace` excerpt showing SIGINT delivered to pytest, C-level signal handler tries to write to the signal-wakeup-fd pipe, gets `write() = -1 EAGAIN (Resource temporarily unavailable)`. Pipe is full because main trio loop isn't iterating often enough to drain it. - **Root-cause chain** — our hard-kill abandons the `daemon=True` driver OS thread after `_HARD_KILL_TIMEOUT`; the subint *inside* that thread is still running `trio.run()`; `_interpreters.destroy()` cannot force-stop a running subint (raises `InterpreterError`); legacy subints share the main GIL → abandoned subint starves main trio loop → wakeup-fd fills → SIGINT silently dropped. - **Why it's structurally a CPython limit** — no public force-destroy primitive for a running subint; the only escape is per-interpreter GIL isolation, gated on msgspec PEP 684 adoption (jcrist/msgspec#563). - **Current escape hatch** — harness-side SIGINT loop in the `daemon` fixture teardown that kills the bg registrar subproc, eventually unblocking a parent-side recv enough for the main loop to drain the wakeup pipe. ### `ai/conc-anal/subint_cancel_delivery_hang_issue.md` (new, 161 LOC) > `git diff HEAD~1..HEAD -- ai/conc-anal/subint_cancel_delivery_hang_issue.md` Writes up the *sibling* hang class — same subint backend, distinct root cause: - **TL;DR** — Ctrl-C-able, so NOT the SIGINT- starvation class; main trio loop iterates fine; ours to fix. - **Symptom** — `test_subint_non_checkpointing_child` hangs past the expected `_HARD_KILL_TIMEOUT` budget even after the subint is torn down. - **Diagnosis** — a parent-side trio task (likely a `chan.recv()` in `process_messages`) parks on an orphaned IPC channel; channel was torn down without emitting a clean EOF / `BrokenResourceError` to the waiting receiver. - **Candidate fix directions** — listed in rough order of preference: 1. Explicit parent-side channel abort in `subint_proc`'s hard-kill teardown (surgical; most likely). 2. Audit `process_messages` to add a timeout or cancel-scope protection that catches the orphaned-recv state. 3. Wrap subint IPC channel construction in a sentinel that can force-close from the parent side regardless of subint liveness. ### `tests/discovery/test_registrar.py` (modified, +52/-1 LOC) > `git diff HEAD~1..HEAD -- tests/discovery/test_registrar.py` Wraps the `trio.run(main)` call at the bottom of `test_stale_entry_is_deleted` in `dump_on_hang(seconds=20, path=)`. Adds a long inline comment that: - Enumerates variant-by-variant status (`[trio]`/`[mp_*]` = clean; `[subint]` = hangs + un-Ctrl-C-able) - Summarizes the `strace` evidence and root-cause chain inline (so a future reader hitting this test doesn't need to cross-ref the doc to understand the hang shape) - Points at `ai/conc-anal/subint_sigint_starvation_issue.md` for full analysis - Cross-links to the *sibling* `subint_cancel_delivery_hang_issue.md` so readers can tell the two classes apart - Explains why it's kept un-`skip`ped: the dump file is useful if the hang ever returns after a refactor. pytest stderr capture would otherwise eat `faulthandler` output, hence the file path. ### `tests/test_subint_cancellation.py` (modified, +26 LOC) > `git diff HEAD~1..HEAD -- tests/test_subint_cancellation.py` Extends the docstring of `test_subint_non_checkpointing_child` with a "KNOWN ISSUE (Ctrl-C-able hang)" block: - Describes the current hang: parent-side orphaned IPC recv after hard-kill; distinct from the SIGINT-starvation sibling class. - Cites `strace` distinguishing signal: wakeup-fd `write() = 1` (not `EAGAIN`) — i.e. main loop iterating. - Points at `ai/conc-anal/subint_cancel_delivery_hang_issue.md` for full analysis + candidate fix directions. - Clarifies that the *other* sibling doc (SIGINT-starvation) is NOT what this test hits. ## Non-code output ### Classification reasoning (why two docs, not one) The user and I converged on the two-doc split after running the suites and noticing two *qualitatively different* hang symptoms: 1. `test_stale_entry_is_deleted[subint]` — pytest process un-Ctrl-C-able. Ctrl-C at the terminal does nothing. Must kill-9 from another shell. 2. `test_subint_non_checkpointing_child` — pytest process Ctrl-C-able. One Ctrl-C at the prompt unblocks cleanly and the test reports a hang via pytest-timeout. From the user: "These cannot be the same bug. Different fix paths. Write them up separately or we'll keep conflating them." `strace` on the `[subint]` hang gave the decisive signal for the first class: ``` --- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} --- write(5, "\2", 1) = -1 EAGAIN (Resource temporarily unavailable) ``` fd 5 is Python's signal-wakeup-fd pipe. `EAGAIN` on a `write()` of 1 byte to a pipe means the pipe buffer is full → reader side (main Python thread inside `trio.run()`) isn't consuming. That's the GIL-hostage signature. The second class's `strace` showed `write(5, "\2", 1) = 1` — clean drain — so the main trio loop was iterating and the hang had to be on the application side of things, not the kernel-↔-Python signal boundary. ### Why the candidate fix for class 2 is "explicit parent-side channel abort" The second hang class has the trio loop alive. A parked `chan.recv()` that will never get bytes is fundamentally a tractor-side resource-lifetime bug — the IPC channel was torn down (subint destroyed) but no one explicitly raised `BrokenResourceError` at the parent-side receiver. The `subint_proc` hard-kill path is the natural place to add that notification, because it already knows the subint is unreachable at that point. Alternative fix paths (blanket timeouts on `process_messages`, sentinel-wrapped channels) are less surgical and risk masking unrelated bugs — hence the preference ordering in the doc. ### Why we're not just patching the code now The user explicitly deferred the fix to a later commit: "Document both classes now, land the fix for class 2 separately so the diff reviews clean." This matches the incremental-commits preference from memory.