3.8 KiB
Prompt
Collab between user (goodboy) and claude to disambiguate two distinct hang modes hit during Phase B subint-spawn-backend bringup (issue #379). The user ran the failing suites, captured strace evidence on hung pytest pids, and set the framing:
“These cannot be the same bug. Different fix paths. Write them up separately or we’ll keep conflating them.”
Follow-on asks: - Cross-link each doc from its reproducer test so a future triager lands on the analysis without needing to grep ai/conc-anal/. - On test_stale_entry_is_deleted: arm dump_on_hang(seconds=20, path=...) so a regression captures a stack dump (pytest stderr capture otherwise eats faulthandler output). - Keep test_stale_entry_is_deleted un-skipped so the dump file remains inspectable.
Defer the actual fix for the class-2 hang (Ctrl-C-able, tractor-side bug) to a later commit per the incremental-commits preference.
Response summary
Wrote two new ai/conc-anal/ classification docs and cross-linked them from the reproducer tests:
subint_sigint_starvation_issue.md— class 1: abandoned-legacy-subint thread + shared GIL starves main trio loop → signal-wakeup-fd pipe fills → SIGINT silently dropped (write() = EAGAIN). Pytest process un-Ctrl-C-able. Structurally a CPython limit; blocked onmsgspecPEP 684 support (jcrist/msgspec#563). Reproducer:test_stale_entry_is_deleted[subint].subint_cancel_delivery_hang_issue.md— class 2: parent-side trio task parks on an orphaned IPC channel after subint teardown; no clean EOF delivered to waiting receiver. Ctrl-C-able (main trio loop iterating fine). OUR bug to fix. Candidate fix: explicit parent-side channel abort insubint_proc’s hard-kill teardown. Reproducer:test_subint_non_checkpointing_child.
Test-side cross-links: - tests/discovery/test_registrar.py: test_stale_entry_is_deleted → trio.run(main) wrapped in dump_on_hang(seconds=20, path=<per-method-tmp>); long inline comment summarizes strace evidence + root-cause chain and points at both docs. - tests/test_subint_cancellation.py: test_subint_non_checkpointing_child docstring extended with “KNOWN ISSUE (Ctrl-C-able hang)” section pointing at the class-2 doc + noting the class-1 doc is NOT what this test hits.
Files changed
ai/conc-anal/subint_sigint_starvation_issue.md— new, 205 LOCai/conc-anal/subint_cancel_delivery_hang_issue.md— new, 161 LOCtests/discovery/test_registrar.py— +52/-1 (armdump_on_hang, inline-comment cross-link)tests/test_subint_cancellation.py— +26 (docstring “KNOWN ISSUE” block)
Human edits
Substantive collab — prose was jointly iterated:
- User framed the two-doc split, set the classification criteria (Ctrl-C-able vs not), and provided the
straceevidence. - User decided to keep
test_stale_entry_is_deletedun-skipped (my initial suggestion waspytestmark.skipif(spawn_backend=='subint')). - User chose the candidate fix ordering for class 2 and marked “explicit parent-side channel abort” as the surgical preferred fix.
- User picked the file naming convention (
subint_<hang-shape>_issue.md) over my initialhang_class_{1,2}.md. - Assistant drafted the prose, aggregated prior- session root-cause findings from Phase B.2/B.3 bringup, and wrote the test-side cross-linking comments.
No further mechanical edits expected before commit; user may still rewrap via scripts/rewrap.py if preferred.