--- model: claude-opus-4-7[1m] service: claude session: subint-phase-b-hang-classification timestamp: 2026-04-20T19:27:39Z git_ref: HEAD (pre-commit; on branch `subint_spawner_backend`) scope: docs substantive: true raw_file: 20260420T192739Z_5e8cd8b2_prompt_io.raw.md --- ## Prompt Collab between user (`goodboy`) and `claude` to disambiguate two distinct hang modes hit during Phase B subint-spawn-backend bringup (issue #379). The user ran the failing suites, captured `strace` evidence on hung pytest pids, and set the framing: > "These cannot be the same bug. Different fix > paths. Write them up separately or we'll keep > conflating them." Follow-on asks: - Cross-link each doc from its reproducer test so a future triager lands on the analysis without needing to grep `ai/conc-anal/`. - On `test_stale_entry_is_deleted`: arm `dump_on_hang(seconds=20, path=...)` so a regression captures a stack dump (pytest stderr capture otherwise eats `faulthandler` output). - Keep `test_stale_entry_is_deleted` un-`skip`ped so the dump file remains inspectable. Defer the actual fix for the class-2 hang (Ctrl-C-able, tractor-side bug) to a later commit per the incremental-commits preference. ## Response summary Wrote two new `ai/conc-anal/` classification docs and cross-linked them from the reproducer tests: 1. **`subint_sigint_starvation_issue.md`** — class 1: abandoned-legacy-subint thread + shared GIL starves main trio loop → signal-wakeup-fd pipe fills → SIGINT silently dropped (`write() = EAGAIN`). Pytest process un-Ctrl-C-able. Structurally a CPython limit; blocked on `msgspec` PEP 684 support (jcrist/msgspec#563). Reproducer: `test_stale_entry_is_deleted[subint]`. 2. **`subint_cancel_delivery_hang_issue.md`** — class 2: parent-side trio task parks on an orphaned IPC channel after subint teardown; no clean EOF delivered to waiting receiver. Ctrl-C-able (main trio loop iterating fine). OUR bug to fix. Candidate fix: explicit parent-side channel abort in `subint_proc`'s hard-kill teardown. Reproducer: `test_subint_non_checkpointing_child`. Test-side cross-links: - `tests/discovery/test_registrar.py`: `test_stale_entry_is_deleted` → `trio.run(main)` wrapped in `dump_on_hang(seconds=20, path=)`; long inline comment summarizes `strace` evidence + root-cause chain and points at both docs. - `tests/test_subint_cancellation.py`: `test_subint_non_checkpointing_child` docstring extended with "KNOWN ISSUE (Ctrl-C-able hang)" section pointing at the class-2 doc + noting the class-1 doc is NOT what this test hits. ## Files changed - `ai/conc-anal/subint_sigint_starvation_issue.md` — new, 205 LOC - `ai/conc-anal/subint_cancel_delivery_hang_issue.md` — new, 161 LOC - `tests/discovery/test_registrar.py` — +52/-1 (arm `dump_on_hang`, inline-comment cross-link) - `tests/test_subint_cancellation.py` — +26 (docstring "KNOWN ISSUE" block) ## Human edits Substantive collab — prose was jointly iterated: - User framed the two-doc split, set the classification criteria (Ctrl-C-able vs not), and provided the `strace` evidence. - User decided to keep `test_stale_entry_is_deleted` un-`skip`ped (my initial suggestion was `pytestmark.skipif(spawn_backend=='subint')`). - User chose the candidate fix ordering for class 2 and marked "explicit parent-side channel abort" as the surgical preferred fix. - User picked the file naming convention (`subint__issue.md`) over my initial `hang_class_{1,2}.md`. - Assistant drafted the prose, aggregated prior- session root-cause findings from Phase B.2/B.3 bringup, and wrote the test-side cross-linking comments. No further mechanical edits expected before commit; user may still rewrap via `scripts/rewrap.py` if preferred.