Add prompt-io log for `subint` hang-class docs
Log the `claude-opus-4-7` collab that produced `e92e3cd2` ("Doc `subint`
backend hang classes + arm `dump_on_hang`"). Substantive bc the two new
`ai/conc-anal/` docs were jointly authored — user framed the two-class
split + set candidate-fix ordering for the class-2 (Ctrl-C-able) hang;
claude drafted the prose and the test-side cross-linking comments.
`.raw.md` is in diff-ref mode — per-file pointers via `git diff
e92e3cd2~1..e92e3cd2 -- <path>` rather than re-embedding content that
already lives in `git log -p`.
Prompt-IO: ai/prompt-io/claude/20260420T192739Z_5e8cd8b2_prompt_io.md
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
subint_forkserver_backend
parent
4a3254583b
commit
a65fded4c6
|
|
@ -0,0 +1,111 @@
|
||||||
|
---
|
||||||
|
model: claude-opus-4-7[1m]
|
||||||
|
service: claude
|
||||||
|
session: subint-phase-b-hang-classification
|
||||||
|
timestamp: 2026-04-20T19:27:39Z
|
||||||
|
git_ref: HEAD (pre-commit; on branch `subint_spawner_backend`)
|
||||||
|
scope: docs
|
||||||
|
substantive: true
|
||||||
|
raw_file: 20260420T192739Z_5e8cd8b2_prompt_io.raw.md
|
||||||
|
---
|
||||||
|
|
||||||
|
## Prompt
|
||||||
|
|
||||||
|
Collab between user (`goodboy`) and `claude` to
|
||||||
|
disambiguate two distinct hang modes hit during
|
||||||
|
Phase B subint-spawn-backend bringup (issue #379).
|
||||||
|
The user ran the failing suites, captured `strace`
|
||||||
|
evidence on hung pytest pids, and set the framing:
|
||||||
|
|
||||||
|
> "These cannot be the same bug. Different fix
|
||||||
|
> paths. Write them up separately or we'll keep
|
||||||
|
> conflating them."
|
||||||
|
|
||||||
|
Follow-on asks:
|
||||||
|
- Cross-link each doc from its reproducer test so
|
||||||
|
a future triager lands on the analysis without
|
||||||
|
needing to grep `ai/conc-anal/`.
|
||||||
|
- On `test_stale_entry_is_deleted`: arm
|
||||||
|
`dump_on_hang(seconds=20, path=...)` so a
|
||||||
|
regression captures a stack dump (pytest stderr
|
||||||
|
capture otherwise eats `faulthandler` output).
|
||||||
|
- Keep `test_stale_entry_is_deleted` un-`skip`ped
|
||||||
|
so the dump file remains inspectable.
|
||||||
|
|
||||||
|
Defer the actual fix for the class-2 hang
|
||||||
|
(Ctrl-C-able, tractor-side bug) to a later commit
|
||||||
|
per the incremental-commits preference.
|
||||||
|
|
||||||
|
## Response summary
|
||||||
|
|
||||||
|
Wrote two new `ai/conc-anal/` classification docs
|
||||||
|
and cross-linked them from the reproducer tests:
|
||||||
|
|
||||||
|
1. **`subint_sigint_starvation_issue.md`** —
|
||||||
|
class 1: abandoned-legacy-subint thread +
|
||||||
|
shared GIL starves main trio loop →
|
||||||
|
signal-wakeup-fd pipe fills → SIGINT silently
|
||||||
|
dropped (`write() = EAGAIN`). Pytest process
|
||||||
|
un-Ctrl-C-able. Structurally a CPython limit;
|
||||||
|
blocked on `msgspec` PEP 684 support
|
||||||
|
(jcrist/msgspec#563). Reproducer:
|
||||||
|
`test_stale_entry_is_deleted[subint]`.
|
||||||
|
|
||||||
|
2. **`subint_cancel_delivery_hang_issue.md`** —
|
||||||
|
class 2: parent-side trio task parks on an
|
||||||
|
orphaned IPC channel after subint teardown;
|
||||||
|
no clean EOF delivered to waiting receiver.
|
||||||
|
Ctrl-C-able (main trio loop iterating fine).
|
||||||
|
OUR bug to fix. Candidate fix: explicit
|
||||||
|
parent-side channel abort in `subint_proc`'s
|
||||||
|
hard-kill teardown. Reproducer:
|
||||||
|
`test_subint_non_checkpointing_child`.
|
||||||
|
|
||||||
|
Test-side cross-links:
|
||||||
|
- `tests/discovery/test_registrar.py`:
|
||||||
|
`test_stale_entry_is_deleted` → `trio.run(main)`
|
||||||
|
wrapped in `dump_on_hang(seconds=20,
|
||||||
|
path=<per-method-tmp>)`; long inline comment
|
||||||
|
summarizes `strace` evidence + root-cause chain
|
||||||
|
and points at both docs.
|
||||||
|
- `tests/test_subint_cancellation.py`:
|
||||||
|
`test_subint_non_checkpointing_child` docstring
|
||||||
|
extended with "KNOWN ISSUE (Ctrl-C-able hang)"
|
||||||
|
section pointing at the class-2 doc + noting
|
||||||
|
the class-1 doc is NOT what this test hits.
|
||||||
|
|
||||||
|
## Files changed
|
||||||
|
|
||||||
|
- `ai/conc-anal/subint_sigint_starvation_issue.md`
|
||||||
|
— new, 205 LOC
|
||||||
|
- `ai/conc-anal/subint_cancel_delivery_hang_issue.md`
|
||||||
|
— new, 161 LOC
|
||||||
|
- `tests/discovery/test_registrar.py` — +52/-1
|
||||||
|
(arm `dump_on_hang`, inline-comment cross-link)
|
||||||
|
- `tests/test_subint_cancellation.py` — +26
|
||||||
|
(docstring "KNOWN ISSUE" block)
|
||||||
|
|
||||||
|
## Human edits
|
||||||
|
|
||||||
|
Substantive collab — prose was jointly iterated:
|
||||||
|
|
||||||
|
- User framed the two-doc split, set the
|
||||||
|
classification criteria (Ctrl-C-able vs not),
|
||||||
|
and provided the `strace` evidence.
|
||||||
|
- User decided to keep `test_stale_entry_is_deleted`
|
||||||
|
un-`skip`ped (my initial suggestion was
|
||||||
|
`pytestmark.skipif(spawn_backend=='subint')`).
|
||||||
|
- User chose the candidate fix ordering for
|
||||||
|
class 2 and marked "explicit parent-side channel
|
||||||
|
abort" as the surgical preferred fix.
|
||||||
|
- User picked the file naming convention
|
||||||
|
(`subint_<hang-shape>_issue.md`) over my initial
|
||||||
|
`hang_class_{1,2}.md`.
|
||||||
|
- Assistant drafted the prose, aggregated prior-
|
||||||
|
session root-cause findings from Phase B.2/B.3
|
||||||
|
bringup, and wrote the test-side cross-linking
|
||||||
|
comments.
|
||||||
|
|
||||||
|
No further mechanical edits expected before
|
||||||
|
commit; user may still rewrap via
|
||||||
|
`scripts/rewrap.py` if preferred.
|
||||||
|
|
@ -0,0 +1,198 @@
|
||||||
|
---
|
||||||
|
model: claude-opus-4-7[1m]
|
||||||
|
service: claude
|
||||||
|
timestamp: 2026-04-20T19:27:39Z
|
||||||
|
git_ref: HEAD (pre-commit; will land on branch `subint_spawner_backend`)
|
||||||
|
diff_cmd: git diff HEAD~1..HEAD
|
||||||
|
---
|
||||||
|
|
||||||
|
Collab between `goodboy` (user) and `claude` (this
|
||||||
|
assistant) spanning multiple test-run iterations on
|
||||||
|
branch `subint_spawner_backend`. The user ran the
|
||||||
|
failing suites, captured `strace` evidence on the
|
||||||
|
hung pytest pids, and set the direction ("these are
|
||||||
|
two different hangs — write them up separately so
|
||||||
|
we don't re-confuse ourselves later"). The assistant
|
||||||
|
aggregated prior-session findings (Phase B.2/B.3
|
||||||
|
bringup) into two classification docs + test-side
|
||||||
|
cross-links. All prose was jointly iterated; the
|
||||||
|
user had final say on framing and decided which
|
||||||
|
candidate fix directions to list.
|
||||||
|
|
||||||
|
## Per-file generated content
|
||||||
|
|
||||||
|
### `ai/conc-anal/subint_sigint_starvation_issue.md` (new, 205 LOC)
|
||||||
|
|
||||||
|
> `git diff HEAD~1..HEAD -- ai/conc-anal/subint_sigint_starvation_issue.md`
|
||||||
|
|
||||||
|
Writes up the "abandoned-legacy-subint thread wedges
|
||||||
|
the parent trio loop" class. Key sections:
|
||||||
|
|
||||||
|
- **Symptom** — `test_stale_entry_is_deleted[subint]`
|
||||||
|
hangs indefinitely AND is un-Ctrl-C-able.
|
||||||
|
- **Evidence** — annotated `strace` excerpt showing
|
||||||
|
SIGINT delivered to pytest, C-level signal handler
|
||||||
|
tries to write to the signal-wakeup-fd pipe, gets
|
||||||
|
`write() = -1 EAGAIN (Resource temporarily
|
||||||
|
unavailable)`. Pipe is full because main trio loop
|
||||||
|
isn't iterating often enough to drain it.
|
||||||
|
- **Root-cause chain** — our hard-kill abandons the
|
||||||
|
`daemon=True` driver OS thread after
|
||||||
|
`_HARD_KILL_TIMEOUT`; the subint *inside* that
|
||||||
|
thread is still running `trio.run()`;
|
||||||
|
`_interpreters.destroy()` cannot force-stop a
|
||||||
|
running subint (raises `InterpreterError`); legacy
|
||||||
|
subints share the main GIL → abandoned subint
|
||||||
|
starves main trio loop → wakeup-fd fills → SIGINT
|
||||||
|
silently dropped.
|
||||||
|
- **Why it's structurally a CPython limit** — no
|
||||||
|
public force-destroy primitive for a running
|
||||||
|
subint; the only escape is per-interpreter GIL
|
||||||
|
isolation, gated on msgspec PEP 684 adoption
|
||||||
|
(jcrist/msgspec#563).
|
||||||
|
- **Current escape hatch** — harness-side SIGINT
|
||||||
|
loop in the `daemon` fixture teardown that kills
|
||||||
|
the bg registrar subproc, eventually unblocking
|
||||||
|
a parent-side recv enough for the main loop to
|
||||||
|
drain the wakeup pipe.
|
||||||
|
|
||||||
|
### `ai/conc-anal/subint_cancel_delivery_hang_issue.md` (new, 161 LOC)
|
||||||
|
|
||||||
|
> `git diff HEAD~1..HEAD -- ai/conc-anal/subint_cancel_delivery_hang_issue.md`
|
||||||
|
|
||||||
|
Writes up the *sibling* hang class — same subint
|
||||||
|
backend, distinct root cause:
|
||||||
|
|
||||||
|
- **TL;DR** — Ctrl-C-able, so NOT the SIGINT-
|
||||||
|
starvation class; main trio loop iterates fine;
|
||||||
|
ours to fix.
|
||||||
|
- **Symptom** — `test_subint_non_checkpointing_child`
|
||||||
|
hangs past the expected `_HARD_KILL_TIMEOUT`
|
||||||
|
budget even after the subint is torn down.
|
||||||
|
- **Diagnosis** — a parent-side trio task (likely
|
||||||
|
a `chan.recv()` in `process_messages`) parks on
|
||||||
|
an orphaned IPC channel; channel was torn down
|
||||||
|
without emitting a clean EOF /
|
||||||
|
`BrokenResourceError` to the waiting receiver.
|
||||||
|
- **Candidate fix directions** — listed in rough
|
||||||
|
order of preference:
|
||||||
|
1. Explicit parent-side channel abort in
|
||||||
|
`subint_proc`'s hard-kill teardown (surgical;
|
||||||
|
most likely).
|
||||||
|
2. Audit `process_messages` to add a timeout or
|
||||||
|
cancel-scope protection that catches the
|
||||||
|
orphaned-recv state.
|
||||||
|
3. Wrap subint IPC channel construction in a
|
||||||
|
sentinel that can force-close from the parent
|
||||||
|
side regardless of subint liveness.
|
||||||
|
|
||||||
|
### `tests/discovery/test_registrar.py` (modified, +52/-1 LOC)
|
||||||
|
|
||||||
|
> `git diff HEAD~1..HEAD -- tests/discovery/test_registrar.py`
|
||||||
|
|
||||||
|
Wraps the `trio.run(main)` call at the bottom of
|
||||||
|
`test_stale_entry_is_deleted` in
|
||||||
|
`dump_on_hang(seconds=20, path=<per-method-tmp>)`.
|
||||||
|
Adds a long inline comment that:
|
||||||
|
- Enumerates variant-by-variant status
|
||||||
|
(`[trio]`/`[mp_*]` = clean; `[subint]` = hangs
|
||||||
|
+ un-Ctrl-C-able)
|
||||||
|
- Summarizes the `strace` evidence and root-cause
|
||||||
|
chain inline (so a future reader hitting this
|
||||||
|
test doesn't need to cross-ref the doc to
|
||||||
|
understand the hang shape)
|
||||||
|
- Points at
|
||||||
|
`ai/conc-anal/subint_sigint_starvation_issue.md`
|
||||||
|
for full analysis
|
||||||
|
- Cross-links to the *sibling*
|
||||||
|
`subint_cancel_delivery_hang_issue.md` so
|
||||||
|
readers can tell the two classes apart
|
||||||
|
- Explains why it's kept un-`skip`ped: the dump
|
||||||
|
file is useful if the hang ever returns after
|
||||||
|
a refactor. pytest stderr capture would
|
||||||
|
otherwise eat `faulthandler` output, hence the
|
||||||
|
file path.
|
||||||
|
|
||||||
|
### `tests/test_subint_cancellation.py` (modified, +26 LOC)
|
||||||
|
|
||||||
|
> `git diff HEAD~1..HEAD -- tests/test_subint_cancellation.py`
|
||||||
|
|
||||||
|
Extends the docstring of
|
||||||
|
`test_subint_non_checkpointing_child` with a
|
||||||
|
"KNOWN ISSUE (Ctrl-C-able hang)" block:
|
||||||
|
- Describes the current hang: parent-side orphaned
|
||||||
|
IPC recv after hard-kill; distinct from the
|
||||||
|
SIGINT-starvation sibling class.
|
||||||
|
- Cites `strace` distinguishing signal: wakeup-fd
|
||||||
|
`write() = 1` (not `EAGAIN`) — i.e. main loop
|
||||||
|
iterating.
|
||||||
|
- Points at
|
||||||
|
`ai/conc-anal/subint_cancel_delivery_hang_issue.md`
|
||||||
|
for full analysis + candidate fix directions.
|
||||||
|
- Clarifies that the *other* sibling doc
|
||||||
|
(SIGINT-starvation) is NOT what this test hits.
|
||||||
|
|
||||||
|
## Non-code output
|
||||||
|
|
||||||
|
### Classification reasoning (why two docs, not one)
|
||||||
|
|
||||||
|
The user and I converged on the two-doc split after
|
||||||
|
running the suites and noticing two *qualitatively
|
||||||
|
different* hang symptoms:
|
||||||
|
|
||||||
|
1. `test_stale_entry_is_deleted[subint]` — pytest
|
||||||
|
process un-Ctrl-C-able. Ctrl-C at the terminal
|
||||||
|
does nothing. Must kill-9 from another shell.
|
||||||
|
2. `test_subint_non_checkpointing_child` — pytest
|
||||||
|
process Ctrl-C-able. One Ctrl-C at the prompt
|
||||||
|
unblocks cleanly and the test reports a hang
|
||||||
|
via pytest-timeout.
|
||||||
|
|
||||||
|
From the user: "These cannot be the same bug.
|
||||||
|
Different fix paths. Write them up separately or
|
||||||
|
we'll keep conflating them."
|
||||||
|
|
||||||
|
`strace` on the `[subint]` hang gave the decisive
|
||||||
|
signal for the first class:
|
||||||
|
|
||||||
|
```
|
||||||
|
--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
|
||||||
|
write(5, "\2", 1) = -1 EAGAIN (Resource temporarily unavailable)
|
||||||
|
```
|
||||||
|
|
||||||
|
fd 5 is Python's signal-wakeup-fd pipe. `EAGAIN`
|
||||||
|
on a `write()` of 1 byte to a pipe means the pipe
|
||||||
|
buffer is full → reader side (main Python thread
|
||||||
|
inside `trio.run()`) isn't consuming. That's the
|
||||||
|
GIL-hostage signature.
|
||||||
|
|
||||||
|
The second class's `strace` showed `write(5, "\2",
|
||||||
|
1) = 1` — clean drain — so the main trio loop was
|
||||||
|
iterating and the hang had to be on the application
|
||||||
|
side of things, not the kernel-↔-Python signal
|
||||||
|
boundary.
|
||||||
|
|
||||||
|
### Why the candidate fix for class 2 is "explicit parent-side channel abort"
|
||||||
|
|
||||||
|
The second hang class has the trio loop alive. A
|
||||||
|
parked `chan.recv()` that will never get bytes is
|
||||||
|
fundamentally a tractor-side resource-lifetime bug
|
||||||
|
— the IPC channel was torn down (subint destroyed)
|
||||||
|
but no one explicitly raised
|
||||||
|
`BrokenResourceError` at the parent-side receiver.
|
||||||
|
The `subint_proc` hard-kill path is the natural
|
||||||
|
place to add that notification, because it already
|
||||||
|
knows the subint is unreachable at that point.
|
||||||
|
|
||||||
|
Alternative fix paths (blanket timeouts on
|
||||||
|
`process_messages`, sentinel-wrapped channels) are
|
||||||
|
less surgical and risk masking unrelated bugs —
|
||||||
|
hence the preference ordering in the doc.
|
||||||
|
|
||||||
|
### Why we're not just patching the code now
|
||||||
|
|
||||||
|
The user explicitly deferred the fix to a later
|
||||||
|
commit: "Document both classes now, land the fix
|
||||||
|
for class 2 separately so the diff reviews clean."
|
||||||
|
This matches the incremental-commits preference
|
||||||
|
from memory.
|
||||||
Loading…
Reference in New Issue