7.2 KiB
Collab between goodboy (user) and claude (this assistant) spanning multiple test-run iterations on branch subint_spawner_backend. The user ran the failing suites, captured strace evidence on the hung pytest pids, and set the direction (“these are two different hangs — write them up separately so we don’t re-confuse ourselves later”). The assistant aggregated prior-session findings (Phase B.2/B.3 bringup) into two classification docs + test-side cross-links. All prose was jointly iterated; the user had final say on framing and decided which candidate fix directions to list.
Per-file generated content
ai/conc-anal/subint_sigint_starvation_issue.md (new, 205 LOC)
git diff HEAD~1..HEAD -- ai/conc-anal/subint_sigint_starvation_issue.md
Writes up the “abandoned-legacy-subint thread wedges the parent trio loop” class. Key sections:
- Symptom —
test_stale_entry_is_deleted[subint]hangs indefinitely AND is un-Ctrl-C-able. - Evidence — annotated
straceexcerpt showing SIGINT delivered to pytest, C-level signal handler tries to write to the signal-wakeup-fd pipe, getswrite() = -1 EAGAIN (Resource temporarily unavailable). Pipe is full because main trio loop isn’t iterating often enough to drain it. - Root-cause chain — our hard-kill abandons the
daemon=Truedriver OS thread after_HARD_KILL_TIMEOUT; the subint inside that thread is still runningtrio.run();_interpreters.destroy()cannot force-stop a running subint (raisesInterpreterError); legacy subints share the main GIL → abandoned subint starves main trio loop → wakeup-fd fills → SIGINT silently dropped. - Why it’s structurally a CPython limit — no public force-destroy primitive for a running subint; the only escape is per-interpreter GIL isolation, gated on msgspec PEP 684 adoption (jcrist/msgspec#563).
- Current escape hatch — harness-side SIGINT loop in the
daemonfixture teardown that kills the bg registrar subproc, eventually unblocking a parent-side recv enough for the main loop to drain the wakeup pipe.
ai/conc-anal/subint_cancel_delivery_hang_issue.md (new, 161 LOC)
git diff HEAD~1..HEAD -- ai/conc-anal/subint_cancel_delivery_hang_issue.md
Writes up the sibling hang class — same subint backend, distinct root cause:
- TL;DR — Ctrl-C-able, so NOT the SIGINT- starvation class; main trio loop iterates fine; ours to fix.
- Symptom —
test_subint_non_checkpointing_childhangs past the expected_HARD_KILL_TIMEOUTbudget even after the subint is torn down. - Diagnosis — a parent-side trio task (likely a
chan.recv()inprocess_messages) parks on an orphaned IPC channel; channel was torn down without emitting a clean EOF /BrokenResourceErrorto the waiting receiver. - Candidate fix directions — listed in rough order of preference:
- Explicit parent-side channel abort in
subint_proc’s hard-kill teardown (surgical; most likely). - Audit
process_messagesto add a timeout or cancel-scope protection that catches the orphaned-recv state. - Wrap subint IPC channel construction in a sentinel that can force-close from the parent side regardless of subint liveness.
- Explicit parent-side channel abort in
tests/discovery/test_registrar.py (modified, +52/-1 LOC)
git diff HEAD~1..HEAD -- tests/discovery/test_registrar.py
Wraps the trio.run(main) call at the bottom of test_stale_entry_is_deleted in dump_on_hang(seconds=20, path=<per-method-tmp>). Adds a long inline comment that: - Enumerates variant-by-variant status ([trio]/[mp_*] = clean; [subint] = hangs + un-Ctrl-C-able) - Summarizes the strace evidence and root-cause chain inline (so a future reader hitting this test doesn’t need to cross-ref the doc to understand the hang shape) - Points at ai/conc-anal/subint_sigint_starvation_issue.md for full analysis - Cross-links to the sibling subint_cancel_delivery_hang_issue.md so readers can tell the two classes apart - Explains why it’s kept un-skipped: the dump file is useful if the hang ever returns after a refactor. pytest stderr capture would otherwise eat faulthandler output, hence the file path.
tests/test_subint_cancellation.py (modified, +26 LOC)
git diff HEAD~1..HEAD -- tests/test_subint_cancellation.py
Extends the docstring of test_subint_non_checkpointing_child with a “KNOWN ISSUE (Ctrl-C-able hang)” block: - Describes the current hang: parent-side orphaned IPC recv after hard-kill; distinct from the SIGINT-starvation sibling class. - Cites strace distinguishing signal: wakeup-fd write() = 1 (not EAGAIN) — i.e. main loop iterating. - Points at ai/conc-anal/subint_cancel_delivery_hang_issue.md for full analysis + candidate fix directions. - Clarifies that the other sibling doc (SIGINT-starvation) is NOT what this test hits.
Non-code output
Classification reasoning (why two docs, not one)
The user and I converged on the two-doc split after running the suites and noticing two qualitatively different hang symptoms:
test_stale_entry_is_deleted[subint]— pytest process un-Ctrl-C-able. Ctrl-C at the terminal does nothing. Must kill-9 from another shell.test_subint_non_checkpointing_child— pytest process Ctrl-C-able. One Ctrl-C at the prompt unblocks cleanly and the test reports a hang via pytest-timeout.
From the user: “These cannot be the same bug. Different fix paths. Write them up separately or we’ll keep conflating them.”
strace on the [subint] hang gave the decisive signal for the first class:
--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
write(5, "\2", 1) = -1 EAGAIN (Resource temporarily unavailable)
fd 5 is Python’s signal-wakeup-fd pipe. EAGAIN on a write() of 1 byte to a pipe means the pipe buffer is full → reader side (main Python thread inside trio.run()) isn’t consuming. That’s the GIL-hostage signature.
The second class’s strace showed write(5, "\2", 1) = 1 — clean drain — so the main trio loop was iterating and the hang had to be on the application side of things, not the kernel-↔︎-Python signal boundary.
Why the candidate fix for class 2 is “explicit parent-side channel abort”
The second hang class has the trio loop alive. A parked chan.recv() that will never get bytes is fundamentally a tractor-side resource-lifetime bug — the IPC channel was torn down (subint destroyed) but no one explicitly raised BrokenResourceError at the parent-side receiver. The subint_proc hard-kill path is the natural place to add that notification, because it already knows the subint is unreachable at that point.
Alternative fix paths (blanket timeouts on process_messages, sentinel-wrapped channels) are less surgical and risk masking unrelated bugs — hence the preference ordering in the doc.
Why we’re not just patching the code now
The user explicitly deferred the fix to a later commit: “Document both classes now, land the fix for class 2 separately so the diff reviews clean.” This matches the incremental-commits preference from memory.