Add prompt-io log for `subint` hang-class docs

Log the `claude-opus-4-7` collab that produced `e92e3cd2` ("Doc `subint`
backend hang classes + arm `dump_on_hang`"). Substantive bc the two new
`ai/conc-anal/` docs were jointly authored — user framed the two-class
split + set candidate-fix ordering for the class-2 (Ctrl-C-able) hang;
claude drafted the prose and the test-side cross-linking comments.

`.raw.md` is in diff-ref mode — per-file pointers via `git diff
e92e3cd2~1..e92e3cd2 -- <path>` rather than re-embedding content that
already lives in `git log -p`.

Prompt-IO: ai/prompt-io/claude/20260420T192739Z_5e8cd8b2_prompt_io.md

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
subint_forkserver_backend
Gud Boi 2026-04-20 16:04:19 -04:00
parent 4a3254583b
commit a65fded4c6
2 changed files with 309 additions and 0 deletions

View File

@ -0,0 +1,111 @@
---
model: claude-opus-4-7[1m]
service: claude
session: subint-phase-b-hang-classification
timestamp: 2026-04-20T19:27:39Z
git_ref: HEAD (pre-commit; on branch `subint_spawner_backend`)
scope: docs
substantive: true
raw_file: 20260420T192739Z_5e8cd8b2_prompt_io.raw.md
---
## Prompt
Collab between user (`goodboy`) and `claude` to
disambiguate two distinct hang modes hit during
Phase B subint-spawn-backend bringup (issue #379).
The user ran the failing suites, captured `strace`
evidence on hung pytest pids, and set the framing:
> "These cannot be the same bug. Different fix
> paths. Write them up separately or we'll keep
> conflating them."
Follow-on asks:
- Cross-link each doc from its reproducer test so
a future triager lands on the analysis without
needing to grep `ai/conc-anal/`.
- On `test_stale_entry_is_deleted`: arm
`dump_on_hang(seconds=20, path=...)` so a
regression captures a stack dump (pytest stderr
capture otherwise eats `faulthandler` output).
- Keep `test_stale_entry_is_deleted` un-`skip`ped
so the dump file remains inspectable.
Defer the actual fix for the class-2 hang
(Ctrl-C-able, tractor-side bug) to a later commit
per the incremental-commits preference.
## Response summary
Wrote two new `ai/conc-anal/` classification docs
and cross-linked them from the reproducer tests:
1. **`subint_sigint_starvation_issue.md`** —
class 1: abandoned-legacy-subint thread +
shared GIL starves main trio loop →
signal-wakeup-fd pipe fills → SIGINT silently
dropped (`write() = EAGAIN`). Pytest process
un-Ctrl-C-able. Structurally a CPython limit;
blocked on `msgspec` PEP 684 support
(jcrist/msgspec#563). Reproducer:
`test_stale_entry_is_deleted[subint]`.
2. **`subint_cancel_delivery_hang_issue.md`** —
class 2: parent-side trio task parks on an
orphaned IPC channel after subint teardown;
no clean EOF delivered to waiting receiver.
Ctrl-C-able (main trio loop iterating fine).
OUR bug to fix. Candidate fix: explicit
parent-side channel abort in `subint_proc`'s
hard-kill teardown. Reproducer:
`test_subint_non_checkpointing_child`.
Test-side cross-links:
- `tests/discovery/test_registrar.py`:
`test_stale_entry_is_deleted``trio.run(main)`
wrapped in `dump_on_hang(seconds=20,
path=<per-method-tmp>)`; long inline comment
summarizes `strace` evidence + root-cause chain
and points at both docs.
- `tests/test_subint_cancellation.py`:
`test_subint_non_checkpointing_child` docstring
extended with "KNOWN ISSUE (Ctrl-C-able hang)"
section pointing at the class-2 doc + noting
the class-1 doc is NOT what this test hits.
## Files changed
- `ai/conc-anal/subint_sigint_starvation_issue.md`
— new, 205 LOC
- `ai/conc-anal/subint_cancel_delivery_hang_issue.md`
— new, 161 LOC
- `tests/discovery/test_registrar.py` — +52/-1
(arm `dump_on_hang`, inline-comment cross-link)
- `tests/test_subint_cancellation.py` — +26
(docstring "KNOWN ISSUE" block)
## Human edits
Substantive collab — prose was jointly iterated:
- User framed the two-doc split, set the
classification criteria (Ctrl-C-able vs not),
and provided the `strace` evidence.
- User decided to keep `test_stale_entry_is_deleted`
un-`skip`ped (my initial suggestion was
`pytestmark.skipif(spawn_backend=='subint')`).
- User chose the candidate fix ordering for
class 2 and marked "explicit parent-side channel
abort" as the surgical preferred fix.
- User picked the file naming convention
(`subint_<hang-shape>_issue.md`) over my initial
`hang_class_{1,2}.md`.
- Assistant drafted the prose, aggregated prior-
session root-cause findings from Phase B.2/B.3
bringup, and wrote the test-side cross-linking
comments.
No further mechanical edits expected before
commit; user may still rewrap via
`scripts/rewrap.py` if preferred.

View File

@ -0,0 +1,198 @@
---
model: claude-opus-4-7[1m]
service: claude
timestamp: 2026-04-20T19:27:39Z
git_ref: HEAD (pre-commit; will land on branch `subint_spawner_backend`)
diff_cmd: git diff HEAD~1..HEAD
---
Collab between `goodboy` (user) and `claude` (this
assistant) spanning multiple test-run iterations on
branch `subint_spawner_backend`. The user ran the
failing suites, captured `strace` evidence on the
hung pytest pids, and set the direction ("these are
two different hangs — write them up separately so
we don't re-confuse ourselves later"). The assistant
aggregated prior-session findings (Phase B.2/B.3
bringup) into two classification docs + test-side
cross-links. All prose was jointly iterated; the
user had final say on framing and decided which
candidate fix directions to list.
## Per-file generated content
### `ai/conc-anal/subint_sigint_starvation_issue.md` (new, 205 LOC)
> `git diff HEAD~1..HEAD -- ai/conc-anal/subint_sigint_starvation_issue.md`
Writes up the "abandoned-legacy-subint thread wedges
the parent trio loop" class. Key sections:
- **Symptom**`test_stale_entry_is_deleted[subint]`
hangs indefinitely AND is un-Ctrl-C-able.
- **Evidence** — annotated `strace` excerpt showing
SIGINT delivered to pytest, C-level signal handler
tries to write to the signal-wakeup-fd pipe, gets
`write() = -1 EAGAIN (Resource temporarily
unavailable)`. Pipe is full because main trio loop
isn't iterating often enough to drain it.
- **Root-cause chain** — our hard-kill abandons the
`daemon=True` driver OS thread after
`_HARD_KILL_TIMEOUT`; the subint *inside* that
thread is still running `trio.run()`;
`_interpreters.destroy()` cannot force-stop a
running subint (raises `InterpreterError`); legacy
subints share the main GIL → abandoned subint
starves main trio loop → wakeup-fd fills → SIGINT
silently dropped.
- **Why it's structurally a CPython limit** — no
public force-destroy primitive for a running
subint; the only escape is per-interpreter GIL
isolation, gated on msgspec PEP 684 adoption
(jcrist/msgspec#563).
- **Current escape hatch** — harness-side SIGINT
loop in the `daemon` fixture teardown that kills
the bg registrar subproc, eventually unblocking
a parent-side recv enough for the main loop to
drain the wakeup pipe.
### `ai/conc-anal/subint_cancel_delivery_hang_issue.md` (new, 161 LOC)
> `git diff HEAD~1..HEAD -- ai/conc-anal/subint_cancel_delivery_hang_issue.md`
Writes up the *sibling* hang class — same subint
backend, distinct root cause:
- **TL;DR** — Ctrl-C-able, so NOT the SIGINT-
starvation class; main trio loop iterates fine;
ours to fix.
- **Symptom**`test_subint_non_checkpointing_child`
hangs past the expected `_HARD_KILL_TIMEOUT`
budget even after the subint is torn down.
- **Diagnosis** — a parent-side trio task (likely
a `chan.recv()` in `process_messages`) parks on
an orphaned IPC channel; channel was torn down
without emitting a clean EOF /
`BrokenResourceError` to the waiting receiver.
- **Candidate fix directions** — listed in rough
order of preference:
1. Explicit parent-side channel abort in
`subint_proc`'s hard-kill teardown (surgical;
most likely).
2. Audit `process_messages` to add a timeout or
cancel-scope protection that catches the
orphaned-recv state.
3. Wrap subint IPC channel construction in a
sentinel that can force-close from the parent
side regardless of subint liveness.
### `tests/discovery/test_registrar.py` (modified, +52/-1 LOC)
> `git diff HEAD~1..HEAD -- tests/discovery/test_registrar.py`
Wraps the `trio.run(main)` call at the bottom of
`test_stale_entry_is_deleted` in
`dump_on_hang(seconds=20, path=<per-method-tmp>)`.
Adds a long inline comment that:
- Enumerates variant-by-variant status
(`[trio]`/`[mp_*]` = clean; `[subint]` = hangs
+ un-Ctrl-C-able)
- Summarizes the `strace` evidence and root-cause
chain inline (so a future reader hitting this
test doesn't need to cross-ref the doc to
understand the hang shape)
- Points at
`ai/conc-anal/subint_sigint_starvation_issue.md`
for full analysis
- Cross-links to the *sibling*
`subint_cancel_delivery_hang_issue.md` so
readers can tell the two classes apart
- Explains why it's kept un-`skip`ped: the dump
file is useful if the hang ever returns after
a refactor. pytest stderr capture would
otherwise eat `faulthandler` output, hence the
file path.
### `tests/test_subint_cancellation.py` (modified, +26 LOC)
> `git diff HEAD~1..HEAD -- tests/test_subint_cancellation.py`
Extends the docstring of
`test_subint_non_checkpointing_child` with a
"KNOWN ISSUE (Ctrl-C-able hang)" block:
- Describes the current hang: parent-side orphaned
IPC recv after hard-kill; distinct from the
SIGINT-starvation sibling class.
- Cites `strace` distinguishing signal: wakeup-fd
`write() = 1` (not `EAGAIN`) — i.e. main loop
iterating.
- Points at
`ai/conc-anal/subint_cancel_delivery_hang_issue.md`
for full analysis + candidate fix directions.
- Clarifies that the *other* sibling doc
(SIGINT-starvation) is NOT what this test hits.
## Non-code output
### Classification reasoning (why two docs, not one)
The user and I converged on the two-doc split after
running the suites and noticing two *qualitatively
different* hang symptoms:
1. `test_stale_entry_is_deleted[subint]` — pytest
process un-Ctrl-C-able. Ctrl-C at the terminal
does nothing. Must kill-9 from another shell.
2. `test_subint_non_checkpointing_child` — pytest
process Ctrl-C-able. One Ctrl-C at the prompt
unblocks cleanly and the test reports a hang
via pytest-timeout.
From the user: "These cannot be the same bug.
Different fix paths. Write them up separately or
we'll keep conflating them."
`strace` on the `[subint]` hang gave the decisive
signal for the first class:
```
--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
write(5, "\2", 1) = -1 EAGAIN (Resource temporarily unavailable)
```
fd 5 is Python's signal-wakeup-fd pipe. `EAGAIN`
on a `write()` of 1 byte to a pipe means the pipe
buffer is full → reader side (main Python thread
inside `trio.run()`) isn't consuming. That's the
GIL-hostage signature.
The second class's `strace` showed `write(5, "\2",
1) = 1` — clean drain — so the main trio loop was
iterating and the hang had to be on the application
side of things, not the kernel-↔-Python signal
boundary.
### Why the candidate fix for class 2 is "explicit parent-side channel abort"
The second hang class has the trio loop alive. A
parked `chan.recv()` that will never get bytes is
fundamentally a tractor-side resource-lifetime bug
— the IPC channel was torn down (subint destroyed)
but no one explicitly raised
`BrokenResourceError` at the parent-side receiver.
The `subint_proc` hard-kill path is the natural
place to add that notification, because it already
knows the subint is unreachable at that point.
Alternative fix paths (blanket timeouts on
`process_messages`, sentinel-wrapped channels) are
less surgical and risk masking unrelated bugs —
hence the preference ordering in the doc.
### Why we're not just patching the code now
The user explicitly deferred the fix to a later
commit: "Document both classes now, land the fix
for class 2 separately so the diff reviews clean."
This matches the incremental-commits preference
from memory.