tractor/ai/prompt-io/claude/20260420T192739Z_5e8cd8b2_p...

3.8 KiB
Raw Blame History

Prompt

Collab between user (goodboy) and claude to disambiguate two distinct hang modes hit during Phase B subint-spawn-backend bringup (issue #379). The user ran the failing suites, captured strace evidence on hung pytest pids, and set the framing:

“These cannot be the same bug. Different fix paths. Write them up separately or well keep conflating them.”

Follow-on asks: - Cross-link each doc from its reproducer test so a future triager lands on the analysis without needing to grep ai/conc-anal/. - On test_stale_entry_is_deleted: arm dump_on_hang(seconds=20, path=...) so a regression captures a stack dump (pytest stderr capture otherwise eats faulthandler output). - Keep test_stale_entry_is_deleted un-skipped so the dump file remains inspectable.

Defer the actual fix for the class-2 hang (Ctrl-C-able, tractor-side bug) to a later commit per the incremental-commits preference.

Response summary

Wrote two new ai/conc-anal/ classification docs and cross-linked them from the reproducer tests:

  1. subint_sigint_starvation_issue.md — class 1: abandoned-legacy-subint thread + shared GIL starves main trio loop → signal-wakeup-fd pipe fills → SIGINT silently dropped (write() = EAGAIN). Pytest process un-Ctrl-C-able. Structurally a CPython limit; blocked on msgspec PEP 684 support (jcrist/msgspec#563). Reproducer: test_stale_entry_is_deleted[subint].

  2. subint_cancel_delivery_hang_issue.md — class 2: parent-side trio task parks on an orphaned IPC channel after subint teardown; no clean EOF delivered to waiting receiver. Ctrl-C-able (main trio loop iterating fine). OUR bug to fix. Candidate fix: explicit parent-side channel abort in subint_procs hard-kill teardown. Reproducer: test_subint_non_checkpointing_child.

Test-side cross-links: - tests/discovery/test_registrar.py: test_stale_entry_is_deletedtrio.run(main) wrapped in dump_on_hang(seconds=20, path=<per-method-tmp>); long inline comment summarizes strace evidence + root-cause chain and points at both docs. - tests/test_subint_cancellation.py: test_subint_non_checkpointing_child docstring extended with “KNOWN ISSUE (Ctrl-C-able hang)” section pointing at the class-2 doc + noting the class-1 doc is NOT what this test hits.

Files changed

  • ai/conc-anal/subint_sigint_starvation_issue.md — new, 205 LOC
  • ai/conc-anal/subint_cancel_delivery_hang_issue.md — new, 161 LOC
  • tests/discovery/test_registrar.py — +52/-1 (arm dump_on_hang, inline-comment cross-link)
  • tests/test_subint_cancellation.py — +26 (docstring “KNOWN ISSUE” block)

Human edits

Substantive collab — prose was jointly iterated:

  • User framed the two-doc split, set the classification criteria (Ctrl-C-able vs not), and provided the strace evidence.
  • User decided to keep test_stale_entry_is_deleted un-skipped (my initial suggestion was pytestmark.skipif(spawn_backend=='subint')).
  • User chose the candidate fix ordering for class 2 and marked “explicit parent-side channel abort” as the surgical preferred fix.
  • User picked the file naming convention (subint_<hang-shape>_issue.md) over my initial hang_class_{1,2}.md.
  • Assistant drafted the prose, aggregated prior- session root-cause findings from Phase B.2/B.3 bringup, and wrote the test-side cross-linking comments.

No further mechanical edits expected before commit; user may still rewrap via scripts/rewrap.py if preferred.