From 41813ac9a00fbb16297ff503ac01407c3f74a54c Mon Sep 17 00:00:00 2001 From: goodboy Date: Thu, 23 Apr 2026 22:34:49 -0400 Subject: [PATCH] Bound peer-clear wait in `async_main` finally MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Fifth diagnostic pass pinpointed the hang to `async_main`'s finally block — every stuck actor reaches `FINALLY ENTER` but never `RETURNING`. Specifically `await ipc_server.wait_for_no_more_ peers()` never returns when a peer-channel handler is stuck: the `_no_more_peers` Event is set only when `server._peers` empties, and stuck handlers keep their channels registered. Wrap the call in `trio.move_on_after(3.0)` + a warning-log on timeout that records the still- connected peer count. 3s is enough for any graceful cancel-ack round-trip; beyond that we're in bug territory and need to proceed with local teardown so the parent's `_ForkedProc.wait()` can unblock. Defensive-in-depth regardless of the underlying bug — a local finally shouldn't block on remote cooperation forever. Verified: with this fix, ALL 15 actors reach `async_main: RETURNING` (up from 10/15 before). Test still hangs past 45s though — there's at least one MORE unbounded wait downstream of `async_main`. Candidates enumerated in the doc update (`open_root_actor` finally / `actor.cancel()` internals / trio.run bg tasks / `_serve_ipc_eps` finally). Skip-mark stays on `test_nested_multierrors[subint_forkserver]`. Also updates `subint_forkserver_test_cancellation_leak_issue.md` with the new pinpoint + summary of the 6-item investigation win list: 1. FD hygiene fix (`_close_inherited_fds`) — orphan-SIGINT closed 2. pidfd-based `_ForkedProc.wait` — cancellable 3. `_parent_chan_cs` wiring — shielded parent-chan loop now breakable 4. `wait_for_no_more_peers` bound — THIS commit 5. Ruled-out hypotheses: tree-kill missing, stuck socket recv, capture-pipe fill (all wrong) 6. Remaining unknown: at least one more unbounded wait in the teardown cascade above `async_main` (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit e312a68d8a0bacef1c9006bf1f6cd2ae9919f683) (factored: dropped subint_forkserver conc-anal doc update) --- tractor/runtime/_runtime.py | 20 +++++++++++++++++++- 1 file changed, 19 insertions(+), 1 deletion(-) diff --git a/tractor/runtime/_runtime.py b/tractor/runtime/_runtime.py index 12b2473e..9dcca501 100644 --- a/tractor/runtime/_runtime.py +++ b/tractor/runtime/_runtime.py @@ -1973,7 +1973,25 @@ async def async_main( f' {pformat(ipc_server._peers)}' ) log.runtime(teardown_report) - await ipc_server.wait_for_no_more_peers() + # NOTE: bound the peer-clear wait — otherwise if any + # peer-channel handler is stuck (e.g. never got its + # cancel propagated due to a runtime bug), this wait + # blocks forever and deadlocks the whole actor-tree + # teardown cascade. 3s is enough for any graceful + # cancel-ack round-trip; beyond that we're in bug + # territory and need to proceed with local teardown + # so the parent's `_ForkedProc.wait()` can unblock. + # See `ai/conc-anal/ + # subint_forkserver_test_cancellation_leak_issue.md` + # for the full diagnosis. + with trio.move_on_after(3.0) as _peers_cs: + await ipc_server.wait_for_no_more_peers() + if _peers_cs.cancelled_caught: + teardown_report += ( + f'-> TIMED OUT waiting for peers to clear ' + f'({len(ipc_server._peers)} still connected)\n' + ) + log.warning(teardown_report) teardown_report += ( '-]> all peer channels are complete.\n'