Bound peer-clear wait in `async_main` finally
Fifth diagnostic pass pinpointed the hang to `async_main`'s finally block — every stuck actor reaches `FINALLY ENTER` but never `RETURNING`. Specifically `await ipc_server.wait_for_no_more_ peers()` never returns when a peer-channel handler is stuck: the `_no_more_peers` Event is set only when `server._peers` empties, and stuck handlers keep their channels registered. Wrap the call in `trio.move_on_after(3.0)` + a warning-log on timeout that records the still- connected peer count. 3s is enough for any graceful cancel-ack round-trip; beyond that we're in bug territory and need to proceed with local teardown so the parent's `_ForkedProc.wait()` can unblock. Defensive-in-depth regardless of the underlying bug — a local finally shouldn't block on remote cooperation forever. Verified: with this fix, ALL 15 actors reach `async_main: RETURNING` (up from 10/15 before). Test still hangs past 45s though — there's at least one MORE unbounded wait downstream of `async_main`. Candidates enumerated in the doc update (`open_root_actor` finally / `actor.cancel()` internals / trio.run bg tasks / `_serve_ipc_eps` finally). Skip-mark stays on `test_nested_multierrors[subint_forkserver]`. Also updates `subint_forkserver_test_cancellation_leak_issue.md` with the new pinpoint + summary of the 6-item investigation win list: 1. FD hygiene fix (`_close_inherited_fds`) — orphan-SIGINT closed 2. pidfd-based `_ForkedProc.wait` — cancellable 3. `_parent_chan_cs` wiring — shielded parent-chan loop now breakable 4. `wait_for_no_more_peers` bound — THIS commit 5. Ruled-out hypotheses: tree-kill missing, stuck socket recv, capture-pipe fill (all wrong) 6. Remaining unknown: at least one more unbounded wait in the teardown cascade above `async_main` (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-codesubint_forkserver_backend
parent
4d0555435b
commit
e312a68d8a
|
|
@ -635,6 +635,93 @@ asymmetric. Not purely depth-determined. Some race
|
||||||
condition in nursery teardown when multiple
|
condition in nursery teardown when multiple
|
||||||
siblings error simultaneously.
|
siblings error simultaneously.
|
||||||
|
|
||||||
|
## Update — 2026-04-23 (late, probe iteration 3): hang pinpointed to `wait_for_no_more_peers()`
|
||||||
|
|
||||||
|
Further DIAGDEBUG at every milestone in `async_main`
|
||||||
|
(runtime UP / EXITED service_tn / EXITED root_tn /
|
||||||
|
FINALLY ENTER / RETURNING) plus `_ForkedProc.wait`
|
||||||
|
ENTER/RETURNED per-pidfd. Result:
|
||||||
|
|
||||||
|
**Every stuck actor reaches `async_main: FINALLY
|
||||||
|
ENTER` but NOT `async_main: RETURNING`.**
|
||||||
|
|
||||||
|
That isolates the hang to a specific await in
|
||||||
|
`async_main`'s finally block at
|
||||||
|
`tractor/runtime/_runtime.py:1837+`. The suspect:
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Ensure all peers (actors connected to us as clients) are finished
|
||||||
|
if ipc_server := actor.ipc_server and ipc_server.has_peers(check_chans=True):
|
||||||
|
...
|
||||||
|
await ipc_server.wait_for_no_more_peers() # ← UNBOUNDED, blocks forever
|
||||||
|
```
|
||||||
|
|
||||||
|
`_no_more_peers` is an `Event` set only when
|
||||||
|
`server._peers` empties (see
|
||||||
|
`ipc/_server.py:526-530`). If ANY peer-handler is
|
||||||
|
stuck (the 5 unclosed loops from the earlier pass),
|
||||||
|
it keeps its channel in `server._peers`, so the
|
||||||
|
event never fires, so the wait hangs.
|
||||||
|
|
||||||
|
### Applied fix (partial, landed as defensive-in-depth)
|
||||||
|
|
||||||
|
`tractor/runtime/_runtime.py:1981` —
|
||||||
|
`wait_for_no_more_peers()` call now wrapped in
|
||||||
|
`trio.move_on_after(3.0)` + a warning log when the
|
||||||
|
timeout fires. Commented with the full rationale.
|
||||||
|
|
||||||
|
**Verified:** with this fix, ALL 15 actors reach
|
||||||
|
`async_main: RETURNING` cleanly (up from 10/15
|
||||||
|
reaching end before).
|
||||||
|
|
||||||
|
**Unfortunately:** the test still hangs past 45s
|
||||||
|
total — meaning there's YET ANOTHER unbounded wait
|
||||||
|
downstream of `async_main`. The bounded
|
||||||
|
`wait_for_no_more_peers` unblocks one level, but
|
||||||
|
the cascade has another level above it.
|
||||||
|
|
||||||
|
### Candidates for the remaining hang
|
||||||
|
|
||||||
|
1. `open_root_actor`'s own finally / post-
|
||||||
|
`async_main` flow in `_root.py` — specifically
|
||||||
|
`await actor.cancel(None)` which has its own
|
||||||
|
internal waits.
|
||||||
|
2. The `trio.run()` itself doesn't return even
|
||||||
|
after the root task completes because trio's
|
||||||
|
nursery still has background tasks running.
|
||||||
|
3. Maybe `_serve_ipc_eps`'s finally has an await
|
||||||
|
that blocks when peers aren't clearing.
|
||||||
|
|
||||||
|
### Current stance
|
||||||
|
|
||||||
|
- Defensive `wait_for_no_more_peers` bound landed
|
||||||
|
(good hygiene regardless). Revealing a real
|
||||||
|
deadlock-avoidance gap in tractor's cleanup.
|
||||||
|
- Test still hangs → skip-mark restored on
|
||||||
|
`test_nested_multierrors[subint_forkserver]`.
|
||||||
|
- The full chain of unbounded waits needs another
|
||||||
|
session of drilling, probably at
|
||||||
|
`open_root_actor` / `actor.cancel` level.
|
||||||
|
|
||||||
|
### Summary of this investigation's wins
|
||||||
|
|
||||||
|
1. **FD hygiene fix** (`_close_inherited_fds`) —
|
||||||
|
correct, closed orphan-SIGINT sibling issue.
|
||||||
|
2. **pidfd-based `_ForkedProc.wait`** — cancellable,
|
||||||
|
matches trio_proc pattern.
|
||||||
|
3. **`_parent_chan_cs` wiring** —
|
||||||
|
`Actor.cancel()` now breaks the shielded parent-
|
||||||
|
chan `process_messages` loop.
|
||||||
|
4. **`wait_for_no_more_peers` bounded** —
|
||||||
|
prevents the actor-level finally hang.
|
||||||
|
5. **Ruled-out hypotheses:** tree-kill missing
|
||||||
|
(wrong), stuck socket recv (wrong), capture-
|
||||||
|
pipe fill (wrong).
|
||||||
|
6. **Pinpointed remaining unknown:** at least one
|
||||||
|
more unbounded wait in the teardown cascade
|
||||||
|
above `async_main`. Concrete candidates
|
||||||
|
enumerated above.
|
||||||
|
|
||||||
## Stopgap (landed)
|
## Stopgap (landed)
|
||||||
|
|
||||||
`test_nested_multierrors` skip-marked under
|
`test_nested_multierrors` skip-marked under
|
||||||
|
|
|
||||||
|
|
@ -1973,7 +1973,25 @@ async def async_main(
|
||||||
f' {pformat(ipc_server._peers)}'
|
f' {pformat(ipc_server._peers)}'
|
||||||
)
|
)
|
||||||
log.runtime(teardown_report)
|
log.runtime(teardown_report)
|
||||||
await ipc_server.wait_for_no_more_peers()
|
# NOTE: bound the peer-clear wait — otherwise if any
|
||||||
|
# peer-channel handler is stuck (e.g. never got its
|
||||||
|
# cancel propagated due to a runtime bug), this wait
|
||||||
|
# blocks forever and deadlocks the whole actor-tree
|
||||||
|
# teardown cascade. 3s is enough for any graceful
|
||||||
|
# cancel-ack round-trip; beyond that we're in bug
|
||||||
|
# territory and need to proceed with local teardown
|
||||||
|
# so the parent's `_ForkedProc.wait()` can unblock.
|
||||||
|
# See `ai/conc-anal/
|
||||||
|
# subint_forkserver_test_cancellation_leak_issue.md`
|
||||||
|
# for the full diagnosis.
|
||||||
|
with trio.move_on_after(3.0) as _peers_cs:
|
||||||
|
await ipc_server.wait_for_no_more_peers()
|
||||||
|
if _peers_cs.cancelled_caught:
|
||||||
|
teardown_report += (
|
||||||
|
f'-> TIMED OUT waiting for peers to clear '
|
||||||
|
f'({len(ipc_server._peers)} still connected)\n'
|
||||||
|
)
|
||||||
|
log.warning(teardown_report)
|
||||||
|
|
||||||
teardown_report += (
|
teardown_report += (
|
||||||
'-]> all peer channels are complete.\n'
|
'-]> all peer channels are complete.\n'
|
||||||
|
|
|
||||||
Loading…
Reference in New Issue