tractor/ai/prompt-io/claude/20260420T192739Z_5e8cd8b2_p...

112 lines
3.8 KiB
Markdown
Raw Normal View History

---
model: claude-opus-4-7[1m]
service: claude
session: subint-phase-b-hang-classification
timestamp: 2026-04-20T19:27:39Z
git_ref: HEAD (pre-commit; on branch `subint_spawner_backend`)
scope: docs
substantive: true
raw_file: 20260420T192739Z_5e8cd8b2_prompt_io.raw.md
---
## Prompt
Collab between user (`goodboy`) and `claude` to
disambiguate two distinct hang modes hit during
Phase B subint-spawn-backend bringup (issue #379).
The user ran the failing suites, captured `strace`
evidence on hung pytest pids, and set the framing:
> "These cannot be the same bug. Different fix
> paths. Write them up separately or we'll keep
> conflating them."
Follow-on asks:
- Cross-link each doc from its reproducer test so
a future triager lands on the analysis without
needing to grep `ai/conc-anal/`.
- On `test_stale_entry_is_deleted`: arm
`dump_on_hang(seconds=20, path=...)` so a
regression captures a stack dump (pytest stderr
capture otherwise eats `faulthandler` output).
- Keep `test_stale_entry_is_deleted` un-`skip`ped
so the dump file remains inspectable.
Defer the actual fix for the class-2 hang
(Ctrl-C-able, tractor-side bug) to a later commit
per the incremental-commits preference.
## Response summary
Wrote two new `ai/conc-anal/` classification docs
and cross-linked them from the reproducer tests:
1. **`subint_sigint_starvation_issue.md`** —
class 1: abandoned-legacy-subint thread +
shared GIL starves main trio loop →
signal-wakeup-fd pipe fills → SIGINT silently
dropped (`write() = EAGAIN`). Pytest process
un-Ctrl-C-able. Structurally a CPython limit;
blocked on `msgspec` PEP 684 support
(jcrist/msgspec#563). Reproducer:
`test_stale_entry_is_deleted[subint]`.
2. **`subint_cancel_delivery_hang_issue.md`** —
class 2: parent-side trio task parks on an
orphaned IPC channel after subint teardown;
no clean EOF delivered to waiting receiver.
Ctrl-C-able (main trio loop iterating fine).
OUR bug to fix. Candidate fix: explicit
parent-side channel abort in `subint_proc`'s
hard-kill teardown. Reproducer:
`test_subint_non_checkpointing_child`.
Test-side cross-links:
- `tests/discovery/test_registrar.py`:
`test_stale_entry_is_deleted``trio.run(main)`
wrapped in `dump_on_hang(seconds=20,
path=<per-method-tmp>)`; long inline comment
summarizes `strace` evidence + root-cause chain
and points at both docs.
- `tests/test_subint_cancellation.py`:
`test_subint_non_checkpointing_child` docstring
extended with "KNOWN ISSUE (Ctrl-C-able hang)"
section pointing at the class-2 doc + noting
the class-1 doc is NOT what this test hits.
## Files changed
- `ai/conc-anal/subint_sigint_starvation_issue.md`
— new, 205 LOC
- `ai/conc-anal/subint_cancel_delivery_hang_issue.md`
— new, 161 LOC
- `tests/discovery/test_registrar.py` — +52/-1
(arm `dump_on_hang`, inline-comment cross-link)
- `tests/test_subint_cancellation.py` — +26
(docstring "KNOWN ISSUE" block)
## Human edits
Substantive collab — prose was jointly iterated:
- User framed the two-doc split, set the
classification criteria (Ctrl-C-able vs not),
and provided the `strace` evidence.
- User decided to keep `test_stale_entry_is_deleted`
un-`skip`ped (my initial suggestion was
`pytestmark.skipif(spawn_backend=='subint')`).
- User chose the candidate fix ordering for
class 2 and marked "explicit parent-side channel
abort" as the surgical preferred fix.
- User picked the file naming convention
(`subint_<hang-shape>_issue.md`) over my initial
`hang_class_{1,2}.md`.
- Assistant drafted the prose, aggregated prior-
session root-cause findings from Phase B.2/B.3
bringup, and wrote the test-side cross-linking
comments.
No further mechanical edits expected before
commit; user may still rewrap via
`scripts/rewrap.py` if preferred.