199 lines
7.2 KiB
Markdown
199 lines
7.2 KiB
Markdown
|
|
---
|
||
|
|
model: claude-opus-4-7[1m]
|
||
|
|
service: claude
|
||
|
|
timestamp: 2026-04-20T19:27:39Z
|
||
|
|
git_ref: HEAD (pre-commit; will land on branch `subint_spawner_backend`)
|
||
|
|
diff_cmd: git diff HEAD~1..HEAD
|
||
|
|
---
|
||
|
|
|
||
|
|
Collab between `goodboy` (user) and `claude` (this
|
||
|
|
assistant) spanning multiple test-run iterations on
|
||
|
|
branch `subint_spawner_backend`. The user ran the
|
||
|
|
failing suites, captured `strace` evidence on the
|
||
|
|
hung pytest pids, and set the direction ("these are
|
||
|
|
two different hangs — write them up separately so
|
||
|
|
we don't re-confuse ourselves later"). The assistant
|
||
|
|
aggregated prior-session findings (Phase B.2/B.3
|
||
|
|
bringup) into two classification docs + test-side
|
||
|
|
cross-links. All prose was jointly iterated; the
|
||
|
|
user had final say on framing and decided which
|
||
|
|
candidate fix directions to list.
|
||
|
|
|
||
|
|
## Per-file generated content
|
||
|
|
|
||
|
|
### `ai/conc-anal/subint_sigint_starvation_issue.md` (new, 205 LOC)
|
||
|
|
|
||
|
|
> `git diff HEAD~1..HEAD -- ai/conc-anal/subint_sigint_starvation_issue.md`
|
||
|
|
|
||
|
|
Writes up the "abandoned-legacy-subint thread wedges
|
||
|
|
the parent trio loop" class. Key sections:
|
||
|
|
|
||
|
|
- **Symptom** — `test_stale_entry_is_deleted[subint]`
|
||
|
|
hangs indefinitely AND is un-Ctrl-C-able.
|
||
|
|
- **Evidence** — annotated `strace` excerpt showing
|
||
|
|
SIGINT delivered to pytest, C-level signal handler
|
||
|
|
tries to write to the signal-wakeup-fd pipe, gets
|
||
|
|
`write() = -1 EAGAIN (Resource temporarily
|
||
|
|
unavailable)`. Pipe is full because main trio loop
|
||
|
|
isn't iterating often enough to drain it.
|
||
|
|
- **Root-cause chain** — our hard-kill abandons the
|
||
|
|
`daemon=True` driver OS thread after
|
||
|
|
`_HARD_KILL_TIMEOUT`; the subint *inside* that
|
||
|
|
thread is still running `trio.run()`;
|
||
|
|
`_interpreters.destroy()` cannot force-stop a
|
||
|
|
running subint (raises `InterpreterError`); legacy
|
||
|
|
subints share the main GIL → abandoned subint
|
||
|
|
starves main trio loop → wakeup-fd fills → SIGINT
|
||
|
|
silently dropped.
|
||
|
|
- **Why it's structurally a CPython limit** — no
|
||
|
|
public force-destroy primitive for a running
|
||
|
|
subint; the only escape is per-interpreter GIL
|
||
|
|
isolation, gated on msgspec PEP 684 adoption
|
||
|
|
(jcrist/msgspec#563).
|
||
|
|
- **Current escape hatch** — harness-side SIGINT
|
||
|
|
loop in the `daemon` fixture teardown that kills
|
||
|
|
the bg registrar subproc, eventually unblocking
|
||
|
|
a parent-side recv enough for the main loop to
|
||
|
|
drain the wakeup pipe.
|
||
|
|
|
||
|
|
### `ai/conc-anal/subint_cancel_delivery_hang_issue.md` (new, 161 LOC)
|
||
|
|
|
||
|
|
> `git diff HEAD~1..HEAD -- ai/conc-anal/subint_cancel_delivery_hang_issue.md`
|
||
|
|
|
||
|
|
Writes up the *sibling* hang class — same subint
|
||
|
|
backend, distinct root cause:
|
||
|
|
|
||
|
|
- **TL;DR** — Ctrl-C-able, so NOT the SIGINT-
|
||
|
|
starvation class; main trio loop iterates fine;
|
||
|
|
ours to fix.
|
||
|
|
- **Symptom** — `test_subint_non_checkpointing_child`
|
||
|
|
hangs past the expected `_HARD_KILL_TIMEOUT`
|
||
|
|
budget even after the subint is torn down.
|
||
|
|
- **Diagnosis** — a parent-side trio task (likely
|
||
|
|
a `chan.recv()` in `process_messages`) parks on
|
||
|
|
an orphaned IPC channel; channel was torn down
|
||
|
|
without emitting a clean EOF /
|
||
|
|
`BrokenResourceError` to the waiting receiver.
|
||
|
|
- **Candidate fix directions** — listed in rough
|
||
|
|
order of preference:
|
||
|
|
1. Explicit parent-side channel abort in
|
||
|
|
`subint_proc`'s hard-kill teardown (surgical;
|
||
|
|
most likely).
|
||
|
|
2. Audit `process_messages` to add a timeout or
|
||
|
|
cancel-scope protection that catches the
|
||
|
|
orphaned-recv state.
|
||
|
|
3. Wrap subint IPC channel construction in a
|
||
|
|
sentinel that can force-close from the parent
|
||
|
|
side regardless of subint liveness.
|
||
|
|
|
||
|
|
### `tests/discovery/test_registrar.py` (modified, +52/-1 LOC)
|
||
|
|
|
||
|
|
> `git diff HEAD~1..HEAD -- tests/discovery/test_registrar.py`
|
||
|
|
|
||
|
|
Wraps the `trio.run(main)` call at the bottom of
|
||
|
|
`test_stale_entry_is_deleted` in
|
||
|
|
`dump_on_hang(seconds=20, path=<per-method-tmp>)`.
|
||
|
|
Adds a long inline comment that:
|
||
|
|
- Enumerates variant-by-variant status
|
||
|
|
(`[trio]`/`[mp_*]` = clean; `[subint]` = hangs
|
||
|
|
+ un-Ctrl-C-able)
|
||
|
|
- Summarizes the `strace` evidence and root-cause
|
||
|
|
chain inline (so a future reader hitting this
|
||
|
|
test doesn't need to cross-ref the doc to
|
||
|
|
understand the hang shape)
|
||
|
|
- Points at
|
||
|
|
`ai/conc-anal/subint_sigint_starvation_issue.md`
|
||
|
|
for full analysis
|
||
|
|
- Cross-links to the *sibling*
|
||
|
|
`subint_cancel_delivery_hang_issue.md` so
|
||
|
|
readers can tell the two classes apart
|
||
|
|
- Explains why it's kept un-`skip`ped: the dump
|
||
|
|
file is useful if the hang ever returns after
|
||
|
|
a refactor. pytest stderr capture would
|
||
|
|
otherwise eat `faulthandler` output, hence the
|
||
|
|
file path.
|
||
|
|
|
||
|
|
### `tests/test_subint_cancellation.py` (modified, +26 LOC)
|
||
|
|
|
||
|
|
> `git diff HEAD~1..HEAD -- tests/test_subint_cancellation.py`
|
||
|
|
|
||
|
|
Extends the docstring of
|
||
|
|
`test_subint_non_checkpointing_child` with a
|
||
|
|
"KNOWN ISSUE (Ctrl-C-able hang)" block:
|
||
|
|
- Describes the current hang: parent-side orphaned
|
||
|
|
IPC recv after hard-kill; distinct from the
|
||
|
|
SIGINT-starvation sibling class.
|
||
|
|
- Cites `strace` distinguishing signal: wakeup-fd
|
||
|
|
`write() = 1` (not `EAGAIN`) — i.e. main loop
|
||
|
|
iterating.
|
||
|
|
- Points at
|
||
|
|
`ai/conc-anal/subint_cancel_delivery_hang_issue.md`
|
||
|
|
for full analysis + candidate fix directions.
|
||
|
|
- Clarifies that the *other* sibling doc
|
||
|
|
(SIGINT-starvation) is NOT what this test hits.
|
||
|
|
|
||
|
|
## Non-code output
|
||
|
|
|
||
|
|
### Classification reasoning (why two docs, not one)
|
||
|
|
|
||
|
|
The user and I converged on the two-doc split after
|
||
|
|
running the suites and noticing two *qualitatively
|
||
|
|
different* hang symptoms:
|
||
|
|
|
||
|
|
1. `test_stale_entry_is_deleted[subint]` — pytest
|
||
|
|
process un-Ctrl-C-able. Ctrl-C at the terminal
|
||
|
|
does nothing. Must kill-9 from another shell.
|
||
|
|
2. `test_subint_non_checkpointing_child` — pytest
|
||
|
|
process Ctrl-C-able. One Ctrl-C at the prompt
|
||
|
|
unblocks cleanly and the test reports a hang
|
||
|
|
via pytest-timeout.
|
||
|
|
|
||
|
|
From the user: "These cannot be the same bug.
|
||
|
|
Different fix paths. Write them up separately or
|
||
|
|
we'll keep conflating them."
|
||
|
|
|
||
|
|
`strace` on the `[subint]` hang gave the decisive
|
||
|
|
signal for the first class:
|
||
|
|
|
||
|
|
```
|
||
|
|
--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
|
||
|
|
write(5, "\2", 1) = -1 EAGAIN (Resource temporarily unavailable)
|
||
|
|
```
|
||
|
|
|
||
|
|
fd 5 is Python's signal-wakeup-fd pipe. `EAGAIN`
|
||
|
|
on a `write()` of 1 byte to a pipe means the pipe
|
||
|
|
buffer is full → reader side (main Python thread
|
||
|
|
inside `trio.run()`) isn't consuming. That's the
|
||
|
|
GIL-hostage signature.
|
||
|
|
|
||
|
|
The second class's `strace` showed `write(5, "\2",
|
||
|
|
1) = 1` — clean drain — so the main trio loop was
|
||
|
|
iterating and the hang had to be on the application
|
||
|
|
side of things, not the kernel-↔-Python signal
|
||
|
|
boundary.
|
||
|
|
|
||
|
|
### Why the candidate fix for class 2 is "explicit parent-side channel abort"
|
||
|
|
|
||
|
|
The second hang class has the trio loop alive. A
|
||
|
|
parked `chan.recv()` that will never get bytes is
|
||
|
|
fundamentally a tractor-side resource-lifetime bug
|
||
|
|
— the IPC channel was torn down (subint destroyed)
|
||
|
|
but no one explicitly raised
|
||
|
|
`BrokenResourceError` at the parent-side receiver.
|
||
|
|
The `subint_proc` hard-kill path is the natural
|
||
|
|
place to add that notification, because it already
|
||
|
|
knows the subint is unreachable at that point.
|
||
|
|
|
||
|
|
Alternative fix paths (blanket timeouts on
|
||
|
|
`process_messages`, sentinel-wrapped channels) are
|
||
|
|
less surgical and risk masking unrelated bugs —
|
||
|
|
hence the preference ordering in the doc.
|
||
|
|
|
||
|
|
### Why we're not just patching the code now
|
||
|
|
|
||
|
|
The user explicitly deferred the fix to a later
|
||
|
|
commit: "Document both classes now, land the fix
|
||
|
|
for class 2 separately so the diff reviews clean."
|
||
|
|
This matches the incremental-commits preference
|
||
|
|
from memory.
|