tractor/ai/prompt-io/claude/20260420T192739Z_5e8cd8b2_p...

7.2 KiB
Raw Blame History

Collab between goodboy (user) and claude (this assistant) spanning multiple test-run iterations on branch subint_spawner_backend. The user ran the failing suites, captured strace evidence on the hung pytest pids, and set the direction (“these are two different hangs — write them up separately so we dont re-confuse ourselves later”). The assistant aggregated prior-session findings (Phase B.2/B.3 bringup) into two classification docs + test-side cross-links. All prose was jointly iterated; the user had final say on framing and decided which candidate fix directions to list.

Per-file generated content

ai/conc-anal/subint_sigint_starvation_issue.md (new, 205 LOC)

git diff HEAD~1..HEAD -- ai/conc-anal/subint_sigint_starvation_issue.md

Writes up the “abandoned-legacy-subint thread wedges the parent trio loop” class. Key sections:

  • Symptomtest_stale_entry_is_deleted[subint] hangs indefinitely AND is un-Ctrl-C-able.
  • Evidence — annotated strace excerpt showing SIGINT delivered to pytest, C-level signal handler tries to write to the signal-wakeup-fd pipe, gets write() = -1 EAGAIN (Resource temporarily unavailable). Pipe is full because main trio loop isnt iterating often enough to drain it.
  • Root-cause chain — our hard-kill abandons the daemon=True driver OS thread after _HARD_KILL_TIMEOUT; the subint inside that thread is still running trio.run(); _interpreters.destroy() cannot force-stop a running subint (raises InterpreterError); legacy subints share the main GIL → abandoned subint starves main trio loop → wakeup-fd fills → SIGINT silently dropped.
  • Why its structurally a CPython limit — no public force-destroy primitive for a running subint; the only escape is per-interpreter GIL isolation, gated on msgspec PEP 684 adoption (jcrist/msgspec#563).
  • Current escape hatch — harness-side SIGINT loop in the daemon fixture teardown that kills the bg registrar subproc, eventually unblocking a parent-side recv enough for the main loop to drain the wakeup pipe.

ai/conc-anal/subint_cancel_delivery_hang_issue.md (new, 161 LOC)

git diff HEAD~1..HEAD -- ai/conc-anal/subint_cancel_delivery_hang_issue.md

Writes up the sibling hang class — same subint backend, distinct root cause:

  • TL;DR — Ctrl-C-able, so NOT the SIGINT- starvation class; main trio loop iterates fine; ours to fix.
  • Symptomtest_subint_non_checkpointing_child hangs past the expected _HARD_KILL_TIMEOUT budget even after the subint is torn down.
  • Diagnosis — a parent-side trio task (likely a chan.recv() in process_messages) parks on an orphaned IPC channel; channel was torn down without emitting a clean EOF / BrokenResourceError to the waiting receiver.
  • Candidate fix directions — listed in rough order of preference:
    1. Explicit parent-side channel abort in subint_procs hard-kill teardown (surgical; most likely).
    2. Audit process_messages to add a timeout or cancel-scope protection that catches the orphaned-recv state.
    3. Wrap subint IPC channel construction in a sentinel that can force-close from the parent side regardless of subint liveness.

tests/discovery/test_registrar.py (modified, +52/-1 LOC)

git diff HEAD~1..HEAD -- tests/discovery/test_registrar.py

Wraps the trio.run(main) call at the bottom of test_stale_entry_is_deleted in dump_on_hang(seconds=20, path=<per-method-tmp>). Adds a long inline comment that: - Enumerates variant-by-variant status ([trio]/[mp_*] = clean; [subint] = hangs + un-Ctrl-C-able) - Summarizes the strace evidence and root-cause chain inline (so a future reader hitting this test doesnt need to cross-ref the doc to understand the hang shape) - Points at ai/conc-anal/subint_sigint_starvation_issue.md for full analysis - Cross-links to the sibling subint_cancel_delivery_hang_issue.md so readers can tell the two classes apart - Explains why its kept un-skipped: the dump file is useful if the hang ever returns after a refactor. pytest stderr capture would otherwise eat faulthandler output, hence the file path.

tests/test_subint_cancellation.py (modified, +26 LOC)

git diff HEAD~1..HEAD -- tests/test_subint_cancellation.py

Extends the docstring of test_subint_non_checkpointing_child with a “KNOWN ISSUE (Ctrl-C-able hang)” block: - Describes the current hang: parent-side orphaned IPC recv after hard-kill; distinct from the SIGINT-starvation sibling class. - Cites strace distinguishing signal: wakeup-fd write() = 1 (not EAGAIN) — i.e. main loop iterating. - Points at ai/conc-anal/subint_cancel_delivery_hang_issue.md for full analysis + candidate fix directions. - Clarifies that the other sibling doc (SIGINT-starvation) is NOT what this test hits.

Non-code output

Classification reasoning (why two docs, not one)

The user and I converged on the two-doc split after running the suites and noticing two qualitatively different hang symptoms:

  1. test_stale_entry_is_deleted[subint] — pytest process un-Ctrl-C-able. Ctrl-C at the terminal does nothing. Must kill-9 from another shell.
  2. test_subint_non_checkpointing_child — pytest process Ctrl-C-able. One Ctrl-C at the prompt unblocks cleanly and the test reports a hang via pytest-timeout.

From the user: “These cannot be the same bug. Different fix paths. Write them up separately or well keep conflating them.”

strace on the [subint] hang gave the decisive signal for the first class:

--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
write(5, "\2", 1) = -1 EAGAIN (Resource temporarily unavailable)

fd 5 is Pythons signal-wakeup-fd pipe. EAGAIN on a write() of 1 byte to a pipe means the pipe buffer is full → reader side (main Python thread inside trio.run()) isnt consuming. Thats the GIL-hostage signature.

The second classs strace showed write(5, "\2", 1) = 1 — clean drain — so the main trio loop was iterating and the hang had to be on the application side of things, not the kernel-↔︎-Python signal boundary.

Why the candidate fix for class 2 is “explicit parent-side channel abort”

The second hang class has the trio loop alive. A parked chan.recv() that will never get bytes is fundamentally a tractor-side resource-lifetime bug — the IPC channel was torn down (subint destroyed) but no one explicitly raised BrokenResourceError at the parent-side receiver. The subint_proc hard-kill path is the natural place to add that notification, because it already knows the subint is unreachable at that point.

Alternative fix paths (blanket timeouts on process_messages, sentinel-wrapped channels) are less surgical and risk masking unrelated bugs — hence the preference ordering in the doc.

Why were not just patching the code now

The user explicitly deferred the fix to a later commit: “Document both classes now, land the fix for class 2 separately so the diff reviews clean.” This matches the incremental-commits preference from memory.