Add prompt-io log for `subint` hang-class docs
Log the `claude-opus-4-7` collab that produced `e92e3cd2` ("Doc `subint`
backend hang classes + arm `dump_on_hang`"). Substantive bc the two new
`ai/conc-anal/` docs were jointly authored — user framed the two-class
split + set candidate-fix ordering for the class-2 (Ctrl-C-able) hang;
claude drafted the prose and the test-side cross-linking comments.
`.raw.md` is in diff-ref mode — per-file pointers via `git diff
e92e3cd2~1..e92e3cd2 -- <path>` rather than re-embedding content that
already lives in `git log -p`.
Prompt-IO: ai/prompt-io/claude/20260420T192739Z_5e8cd8b2_prompt_io.md
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
subint_spawner_backend
parent
35796ec8ae
commit
489dc6d0cc
|
|
@ -0,0 +1,111 @@
|
|||
---
|
||||
model: claude-opus-4-7[1m]
|
||||
service: claude
|
||||
session: subint-phase-b-hang-classification
|
||||
timestamp: 2026-04-20T19:27:39Z
|
||||
git_ref: HEAD (pre-commit; on branch `subint_spawner_backend`)
|
||||
scope: docs
|
||||
substantive: true
|
||||
raw_file: 20260420T192739Z_5e8cd8b2_prompt_io.raw.md
|
||||
---
|
||||
|
||||
## Prompt
|
||||
|
||||
Collab between user (`goodboy`) and `claude` to
|
||||
disambiguate two distinct hang modes hit during
|
||||
Phase B subint-spawn-backend bringup (issue #379).
|
||||
The user ran the failing suites, captured `strace`
|
||||
evidence on hung pytest pids, and set the framing:
|
||||
|
||||
> "These cannot be the same bug. Different fix
|
||||
> paths. Write them up separately or we'll keep
|
||||
> conflating them."
|
||||
|
||||
Follow-on asks:
|
||||
- Cross-link each doc from its reproducer test so
|
||||
a future triager lands on the analysis without
|
||||
needing to grep `ai/conc-anal/`.
|
||||
- On `test_stale_entry_is_deleted`: arm
|
||||
`dump_on_hang(seconds=20, path=...)` so a
|
||||
regression captures a stack dump (pytest stderr
|
||||
capture otherwise eats `faulthandler` output).
|
||||
- Keep `test_stale_entry_is_deleted` un-`skip`ped
|
||||
so the dump file remains inspectable.
|
||||
|
||||
Defer the actual fix for the class-2 hang
|
||||
(Ctrl-C-able, tractor-side bug) to a later commit
|
||||
per the incremental-commits preference.
|
||||
|
||||
## Response summary
|
||||
|
||||
Wrote two new `ai/conc-anal/` classification docs
|
||||
and cross-linked them from the reproducer tests:
|
||||
|
||||
1. **`subint_sigint_starvation_issue.md`** —
|
||||
class 1: abandoned-legacy-subint thread +
|
||||
shared GIL starves main trio loop →
|
||||
signal-wakeup-fd pipe fills → SIGINT silently
|
||||
dropped (`write() = EAGAIN`). Pytest process
|
||||
un-Ctrl-C-able. Structurally a CPython limit;
|
||||
blocked on `msgspec` PEP 684 support
|
||||
(jcrist/msgspec#563). Reproducer:
|
||||
`test_stale_entry_is_deleted[subint]`.
|
||||
|
||||
2. **`subint_cancel_delivery_hang_issue.md`** —
|
||||
class 2: parent-side trio task parks on an
|
||||
orphaned IPC channel after subint teardown;
|
||||
no clean EOF delivered to waiting receiver.
|
||||
Ctrl-C-able (main trio loop iterating fine).
|
||||
OUR bug to fix. Candidate fix: explicit
|
||||
parent-side channel abort in `subint_proc`'s
|
||||
hard-kill teardown. Reproducer:
|
||||
`test_subint_non_checkpointing_child`.
|
||||
|
||||
Test-side cross-links:
|
||||
- `tests/discovery/test_registrar.py`:
|
||||
`test_stale_entry_is_deleted` → `trio.run(main)`
|
||||
wrapped in `dump_on_hang(seconds=20,
|
||||
path=<per-method-tmp>)`; long inline comment
|
||||
summarizes `strace` evidence + root-cause chain
|
||||
and points at both docs.
|
||||
- `tests/test_subint_cancellation.py`:
|
||||
`test_subint_non_checkpointing_child` docstring
|
||||
extended with "KNOWN ISSUE (Ctrl-C-able hang)"
|
||||
section pointing at the class-2 doc + noting
|
||||
the class-1 doc is NOT what this test hits.
|
||||
|
||||
## Files changed
|
||||
|
||||
- `ai/conc-anal/subint_sigint_starvation_issue.md`
|
||||
— new, 205 LOC
|
||||
- `ai/conc-anal/subint_cancel_delivery_hang_issue.md`
|
||||
— new, 161 LOC
|
||||
- `tests/discovery/test_registrar.py` — +52/-1
|
||||
(arm `dump_on_hang`, inline-comment cross-link)
|
||||
- `tests/test_subint_cancellation.py` — +26
|
||||
(docstring "KNOWN ISSUE" block)
|
||||
|
||||
## Human edits
|
||||
|
||||
Substantive collab — prose was jointly iterated:
|
||||
|
||||
- User framed the two-doc split, set the
|
||||
classification criteria (Ctrl-C-able vs not),
|
||||
and provided the `strace` evidence.
|
||||
- User decided to keep `test_stale_entry_is_deleted`
|
||||
un-`skip`ped (my initial suggestion was
|
||||
`pytestmark.skipif(spawn_backend=='subint')`).
|
||||
- User chose the candidate fix ordering for
|
||||
class 2 and marked "explicit parent-side channel
|
||||
abort" as the surgical preferred fix.
|
||||
- User picked the file naming convention
|
||||
(`subint_<hang-shape>_issue.md`) over my initial
|
||||
`hang_class_{1,2}.md`.
|
||||
- Assistant drafted the prose, aggregated prior-
|
||||
session root-cause findings from Phase B.2/B.3
|
||||
bringup, and wrote the test-side cross-linking
|
||||
comments.
|
||||
|
||||
No further mechanical edits expected before
|
||||
commit; user may still rewrap via
|
||||
`scripts/rewrap.py` if preferred.
|
||||
|
|
@ -0,0 +1,198 @@
|
|||
---
|
||||
model: claude-opus-4-7[1m]
|
||||
service: claude
|
||||
timestamp: 2026-04-20T19:27:39Z
|
||||
git_ref: HEAD (pre-commit; will land on branch `subint_spawner_backend`)
|
||||
diff_cmd: git diff HEAD~1..HEAD
|
||||
---
|
||||
|
||||
Collab between `goodboy` (user) and `claude` (this
|
||||
assistant) spanning multiple test-run iterations on
|
||||
branch `subint_spawner_backend`. The user ran the
|
||||
failing suites, captured `strace` evidence on the
|
||||
hung pytest pids, and set the direction ("these are
|
||||
two different hangs — write them up separately so
|
||||
we don't re-confuse ourselves later"). The assistant
|
||||
aggregated prior-session findings (Phase B.2/B.3
|
||||
bringup) into two classification docs + test-side
|
||||
cross-links. All prose was jointly iterated; the
|
||||
user had final say on framing and decided which
|
||||
candidate fix directions to list.
|
||||
|
||||
## Per-file generated content
|
||||
|
||||
### `ai/conc-anal/subint_sigint_starvation_issue.md` (new, 205 LOC)
|
||||
|
||||
> `git diff HEAD~1..HEAD -- ai/conc-anal/subint_sigint_starvation_issue.md`
|
||||
|
||||
Writes up the "abandoned-legacy-subint thread wedges
|
||||
the parent trio loop" class. Key sections:
|
||||
|
||||
- **Symptom** — `test_stale_entry_is_deleted[subint]`
|
||||
hangs indefinitely AND is un-Ctrl-C-able.
|
||||
- **Evidence** — annotated `strace` excerpt showing
|
||||
SIGINT delivered to pytest, C-level signal handler
|
||||
tries to write to the signal-wakeup-fd pipe, gets
|
||||
`write() = -1 EAGAIN (Resource temporarily
|
||||
unavailable)`. Pipe is full because main trio loop
|
||||
isn't iterating often enough to drain it.
|
||||
- **Root-cause chain** — our hard-kill abandons the
|
||||
`daemon=True` driver OS thread after
|
||||
`_HARD_KILL_TIMEOUT`; the subint *inside* that
|
||||
thread is still running `trio.run()`;
|
||||
`_interpreters.destroy()` cannot force-stop a
|
||||
running subint (raises `InterpreterError`); legacy
|
||||
subints share the main GIL → abandoned subint
|
||||
starves main trio loop → wakeup-fd fills → SIGINT
|
||||
silently dropped.
|
||||
- **Why it's structurally a CPython limit** — no
|
||||
public force-destroy primitive for a running
|
||||
subint; the only escape is per-interpreter GIL
|
||||
isolation, gated on msgspec PEP 684 adoption
|
||||
(jcrist/msgspec#563).
|
||||
- **Current escape hatch** — harness-side SIGINT
|
||||
loop in the `daemon` fixture teardown that kills
|
||||
the bg registrar subproc, eventually unblocking
|
||||
a parent-side recv enough for the main loop to
|
||||
drain the wakeup pipe.
|
||||
|
||||
### `ai/conc-anal/subint_cancel_delivery_hang_issue.md` (new, 161 LOC)
|
||||
|
||||
> `git diff HEAD~1..HEAD -- ai/conc-anal/subint_cancel_delivery_hang_issue.md`
|
||||
|
||||
Writes up the *sibling* hang class — same subint
|
||||
backend, distinct root cause:
|
||||
|
||||
- **TL;DR** — Ctrl-C-able, so NOT the SIGINT-
|
||||
starvation class; main trio loop iterates fine;
|
||||
ours to fix.
|
||||
- **Symptom** — `test_subint_non_checkpointing_child`
|
||||
hangs past the expected `_HARD_KILL_TIMEOUT`
|
||||
budget even after the subint is torn down.
|
||||
- **Diagnosis** — a parent-side trio task (likely
|
||||
a `chan.recv()` in `process_messages`) parks on
|
||||
an orphaned IPC channel; channel was torn down
|
||||
without emitting a clean EOF /
|
||||
`BrokenResourceError` to the waiting receiver.
|
||||
- **Candidate fix directions** — listed in rough
|
||||
order of preference:
|
||||
1. Explicit parent-side channel abort in
|
||||
`subint_proc`'s hard-kill teardown (surgical;
|
||||
most likely).
|
||||
2. Audit `process_messages` to add a timeout or
|
||||
cancel-scope protection that catches the
|
||||
orphaned-recv state.
|
||||
3. Wrap subint IPC channel construction in a
|
||||
sentinel that can force-close from the parent
|
||||
side regardless of subint liveness.
|
||||
|
||||
### `tests/discovery/test_registrar.py` (modified, +52/-1 LOC)
|
||||
|
||||
> `git diff HEAD~1..HEAD -- tests/discovery/test_registrar.py`
|
||||
|
||||
Wraps the `trio.run(main)` call at the bottom of
|
||||
`test_stale_entry_is_deleted` in
|
||||
`dump_on_hang(seconds=20, path=<per-method-tmp>)`.
|
||||
Adds a long inline comment that:
|
||||
- Enumerates variant-by-variant status
|
||||
(`[trio]`/`[mp_*]` = clean; `[subint]` = hangs
|
||||
+ un-Ctrl-C-able)
|
||||
- Summarizes the `strace` evidence and root-cause
|
||||
chain inline (so a future reader hitting this
|
||||
test doesn't need to cross-ref the doc to
|
||||
understand the hang shape)
|
||||
- Points at
|
||||
`ai/conc-anal/subint_sigint_starvation_issue.md`
|
||||
for full analysis
|
||||
- Cross-links to the *sibling*
|
||||
`subint_cancel_delivery_hang_issue.md` so
|
||||
readers can tell the two classes apart
|
||||
- Explains why it's kept un-`skip`ped: the dump
|
||||
file is useful if the hang ever returns after
|
||||
a refactor. pytest stderr capture would
|
||||
otherwise eat `faulthandler` output, hence the
|
||||
file path.
|
||||
|
||||
### `tests/test_subint_cancellation.py` (modified, +26 LOC)
|
||||
|
||||
> `git diff HEAD~1..HEAD -- tests/test_subint_cancellation.py`
|
||||
|
||||
Extends the docstring of
|
||||
`test_subint_non_checkpointing_child` with a
|
||||
"KNOWN ISSUE (Ctrl-C-able hang)" block:
|
||||
- Describes the current hang: parent-side orphaned
|
||||
IPC recv after hard-kill; distinct from the
|
||||
SIGINT-starvation sibling class.
|
||||
- Cites `strace` distinguishing signal: wakeup-fd
|
||||
`write() = 1` (not `EAGAIN`) — i.e. main loop
|
||||
iterating.
|
||||
- Points at
|
||||
`ai/conc-anal/subint_cancel_delivery_hang_issue.md`
|
||||
for full analysis + candidate fix directions.
|
||||
- Clarifies that the *other* sibling doc
|
||||
(SIGINT-starvation) is NOT what this test hits.
|
||||
|
||||
## Non-code output
|
||||
|
||||
### Classification reasoning (why two docs, not one)
|
||||
|
||||
The user and I converged on the two-doc split after
|
||||
running the suites and noticing two *qualitatively
|
||||
different* hang symptoms:
|
||||
|
||||
1. `test_stale_entry_is_deleted[subint]` — pytest
|
||||
process un-Ctrl-C-able. Ctrl-C at the terminal
|
||||
does nothing. Must kill-9 from another shell.
|
||||
2. `test_subint_non_checkpointing_child` — pytest
|
||||
process Ctrl-C-able. One Ctrl-C at the prompt
|
||||
unblocks cleanly and the test reports a hang
|
||||
via pytest-timeout.
|
||||
|
||||
From the user: "These cannot be the same bug.
|
||||
Different fix paths. Write them up separately or
|
||||
we'll keep conflating them."
|
||||
|
||||
`strace` on the `[subint]` hang gave the decisive
|
||||
signal for the first class:
|
||||
|
||||
```
|
||||
--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
|
||||
write(5, "\2", 1) = -1 EAGAIN (Resource temporarily unavailable)
|
||||
```
|
||||
|
||||
fd 5 is Python's signal-wakeup-fd pipe. `EAGAIN`
|
||||
on a `write()` of 1 byte to a pipe means the pipe
|
||||
buffer is full → reader side (main Python thread
|
||||
inside `trio.run()`) isn't consuming. That's the
|
||||
GIL-hostage signature.
|
||||
|
||||
The second class's `strace` showed `write(5, "\2",
|
||||
1) = 1` — clean drain — so the main trio loop was
|
||||
iterating and the hang had to be on the application
|
||||
side of things, not the kernel-↔-Python signal
|
||||
boundary.
|
||||
|
||||
### Why the candidate fix for class 2 is "explicit parent-side channel abort"
|
||||
|
||||
The second hang class has the trio loop alive. A
|
||||
parked `chan.recv()` that will never get bytes is
|
||||
fundamentally a tractor-side resource-lifetime bug
|
||||
— the IPC channel was torn down (subint destroyed)
|
||||
but no one explicitly raised
|
||||
`BrokenResourceError` at the parent-side receiver.
|
||||
The `subint_proc` hard-kill path is the natural
|
||||
place to add that notification, because it already
|
||||
knows the subint is unreachable at that point.
|
||||
|
||||
Alternative fix paths (blanket timeouts on
|
||||
`process_messages`, sentinel-wrapped channels) are
|
||||
less surgical and risk masking unrelated bugs —
|
||||
hence the preference ordering in the doc.
|
||||
|
||||
### Why we're not just patching the code now
|
||||
|
||||
The user explicitly deferred the fix to a later
|
||||
commit: "Document both classes now, land the fix
|
||||
for class 2 separately so the diff reviews clean."
|
||||
This matches the incremental-commits preference
|
||||
from memory.
|
||||
Loading…
Reference in New Issue