Refine `subint_forkserver` cancel-cascade diag
Third diagnostic pass on
`test_nested_multierrors[subint_forkserver]` hang.
Two prior hypotheses ruled out + a new, more
specific deadlock shape identified.
Ruled out,
- **capture-pipe fill** (`-s` flag changes test):
retested explicitly — `test_nested_multierrors`
hangs identically with and without `-s`. The
earlier observation was likely a competing
pytest process I had running in another session
holding registry state
- **stuck peer-chan recv that cancel can't
break**: pivot from the prior pass. With
`handle_stream_from_peer` instrumented at ENTER
/ `except trio.Cancelled:` / finally: 40
ENTERs, ZERO `trio.Cancelled` hits. Cancel never
reaches those tasks at all — the recvs are
fine, nothing is telling them to stop
Actual deadlock shape: multi-level mutual wait.
root blocks on spawner.wait()
spawner blocks on grandchild.wait()
grandchild blocks on errorer.wait()
errorer Actor.cancel() ran, but proc
never exits
`Actor.cancel()` fired in 12 PIDs — but NOT in
root + 2 direct spawners. Those 3 have peer
handlers stuck because their own `Actor.cancel()`
never runs, which only runs when the enclosing
`tractor.open_nursery()` exits, which waits on
`_ForkedProc.wait()` for the child pidfd to
signal, which only signals when the child
process fully exits.
Refined question: **why does an errorer process
not exit after its `Actor.cancel()` completes?**
Three hypotheses (unverified):
1. `_parent_chan_cs.cancel()` fires but the
shielded loop's recv is stuck in a way cancel
still can't break
2. `async_main`'s post-cancel unwind has other
tasks in `root_tn` awaiting something that
never arrives (e.g. outbound IPC reply)
3. `os._exit(rc)` in `_worker` never runs because
`_child_target` never returns
Next-session probes (priority order):
1. instrument `_worker`'s fork-child branch —
confirm whether `child_target()` returns /
`os._exit(rc)` is reached for errorer PIDs
2. instrument `async_main`'s final unwind — see
which await in teardown doesn't complete
3. compare under `trio_proc` backend at the
equivalent level to spot divergence
No code changes — diagnosis-only.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
subint_forkserver_backend
parent
458a35cf09
commit
ab86f7613d
|
|
@ -431,25 +431,114 @@ Either way, the sync-close hypothesis is **ruled
|
||||||
out**. Reverted the experiment, restored the skip-
|
out**. Reverted the experiment, restored the skip-
|
||||||
mark on the test.
|
mark on the test.
|
||||||
|
|
||||||
### Aside: `-s` flag changes behavior for peer-intensive tests
|
### Aside: `-s` flag does NOT change `test_nested_multierrors` behavior
|
||||||
|
|
||||||
While exploring, noticed
|
Tested explicitly: both with and without `-s`, the
|
||||||
`tests/test_context_stream_semantics.py` under
|
test hangs identically. So the capture-pipe-fill
|
||||||
`--spawn-backend=subint_forkserver` hangs with
|
hypothesis is **ruled out** for this test.
|
||||||
pytest's default `--capture=fd` but passes with
|
|
||||||
`-s` (`--capture=no`). Hypothesis (unverified): fork
|
|
||||||
children inherit pytest's capture pipe for stdout/
|
|
||||||
stderr (fds 1,2 — we preserve these in
|
|
||||||
`_close_inherited_fds`). When subactor logging is
|
|
||||||
verbose, the capture pipe buffer fills, writes block,
|
|
||||||
child can't progress, deadlock.
|
|
||||||
|
|
||||||
If confirmed, fix direction: redirect subactor
|
The earlier `test_context_stream_semantics.py` `-s`
|
||||||
stdout/stderr to `/dev/null` (or a file) in
|
observation was most likely caused by a competing
|
||||||
`_actor_child_main` so subactors don't hold pytest's
|
pytest run in my session (confirmed via process list
|
||||||
capture pipe open. Not a blocker on the main
|
— my leftover pytest was alive at that time and
|
||||||
peer-chan-loop investigation; deserves its own mini-
|
could have been holding state on the default
|
||||||
tracker.
|
registry port).
|
||||||
|
|
||||||
|
## Update — 2026-04-23 (late): cancel delivery ruled in, nursery-wait ruled BLOCKER
|
||||||
|
|
||||||
|
**New diagnostic run** instrumented
|
||||||
|
`handle_stream_from_peer` at ENTER / `except
|
||||||
|
trio.Cancelled:` / finally, plus `Actor.cancel()`
|
||||||
|
just before `self._parent_chan_cs.cancel()`. Result:
|
||||||
|
|
||||||
|
- **40 `handle_stream_from_peer` ENTERs**.
|
||||||
|
- **0 `except trio.Cancelled:` hits** — cancel
|
||||||
|
never fires on any peer-handler.
|
||||||
|
- **35 finally hits** — those handlers exit via
|
||||||
|
peer-initiated EOF (normal return), NOT cancel.
|
||||||
|
- **5 handlers never reach finally** — stuck forever.
|
||||||
|
- **`Actor.cancel()` fired in 12 PIDs** — but the
|
||||||
|
PIDs with peer handlers that DIDN'T fire
|
||||||
|
Actor.cancel are exactly **root + 2 direct
|
||||||
|
spawners**. These 3 actors have peer handlers
|
||||||
|
(for their own subactors) that stay stuck because
|
||||||
|
**`Actor.cancel()` at these levels never runs**.
|
||||||
|
|
||||||
|
### The actual deadlock shape
|
||||||
|
|
||||||
|
`Actor.cancel()` lives in
|
||||||
|
`open_root_actor.__aexit__` / `async_main` teardown.
|
||||||
|
That only runs when the enclosing `async with
|
||||||
|
tractor.open_nursery()` exits. The nursery's
|
||||||
|
`__aexit__` calls the backend `*_proc` spawn target's
|
||||||
|
teardown, which does `soft_kill() →
|
||||||
|
_ForkedProc.wait()` on its child PID. That wait is
|
||||||
|
trio-cancellable via pidfd now (good) — but nothing
|
||||||
|
CANCELS it because the outer scope only cancels when
|
||||||
|
`Actor.cancel()` runs, which only runs when the
|
||||||
|
nursery completes, which waits on the child.
|
||||||
|
|
||||||
|
It's a **multi-level mutual wait**:
|
||||||
|
|
||||||
|
```
|
||||||
|
root blocks on spawner.wait()
|
||||||
|
spawner blocks on grandchild.wait()
|
||||||
|
grandchild blocks on errorer.wait()
|
||||||
|
errorer Actor.cancel() ran, but process
|
||||||
|
may not have fully exited yet
|
||||||
|
(something in root_tn holding on?)
|
||||||
|
```
|
||||||
|
|
||||||
|
Each level waits for the level below. The bottom
|
||||||
|
level (errorer) reaches Actor.cancel(), but its
|
||||||
|
process may not fully exit — meaning its pidfd
|
||||||
|
doesn't go readable, meaning the grandchild's
|
||||||
|
waitpid doesn't return, meaning the grandchild's
|
||||||
|
nursery doesn't unwind, etc. all the way up.
|
||||||
|
|
||||||
|
### Refined question
|
||||||
|
|
||||||
|
**Why does an errorer process not exit after its
|
||||||
|
`Actor.cancel()` completes?**
|
||||||
|
|
||||||
|
Possibilities:
|
||||||
|
1. `_parent_chan_cs.cancel()` fires (shielded
|
||||||
|
parent-chan loop unshielded), but the task is
|
||||||
|
stuck INSIDE the shielded loop's recv in a way
|
||||||
|
that cancel still can't break.
|
||||||
|
2. After `Actor.cancel()` returns, `async_main`
|
||||||
|
still has other tasks in `root_tn` waiting for
|
||||||
|
something that never arrives (e.g. outbound
|
||||||
|
IPC reply delivery).
|
||||||
|
3. The `os._exit(rc)` in `_worker` (at
|
||||||
|
`_subint_forkserver.py`) doesn't run because
|
||||||
|
`_child_target` never returns.
|
||||||
|
|
||||||
|
Next-session candidate probes (in priority order):
|
||||||
|
|
||||||
|
1. **Instrument `_worker`'s fork-child branch** to
|
||||||
|
confirm whether `child_target()` returns (and
|
||||||
|
thus `os._exit(rc)` is reached) for errorer
|
||||||
|
PIDs. If yes → process should die; if no →
|
||||||
|
trace back into `_actor_child_main` /
|
||||||
|
`_trio_main` / `async_main` to find the stuck
|
||||||
|
spot.
|
||||||
|
2. **Instrument `async_main`'s final unwind** to
|
||||||
|
see which await in the teardown doesn't
|
||||||
|
complete.
|
||||||
|
3. **Compare under `trio_proc` backend** at the
|
||||||
|
same `_worker`-equivalent level to see where
|
||||||
|
the flows diverge.
|
||||||
|
|
||||||
|
### Rule-out: NOT a stuck peer-chan recv
|
||||||
|
|
||||||
|
Earlier hypothesis was that the 5 stuck peer-chan
|
||||||
|
loops were blocked on a socket recv that cancel
|
||||||
|
couldn't interrupt. This pass revealed the real
|
||||||
|
cause: cancel **never reaches those tasks** because
|
||||||
|
their owning actor's `Actor.cancel()` never runs.
|
||||||
|
The recvs are fine — they're just parked because
|
||||||
|
nothing is telling them to stop.
|
||||||
|
|
||||||
## Stopgap (landed)
|
## Stopgap (landed)
|
||||||
|
|
||||||
|
|
|
||||||
Loading…
Reference in New Issue