From e312a68d8a0bacef1c9006bf1f6cd2ae9919f683 Mon Sep 17 00:00:00 2001 From: goodboy Date: Thu, 23 Apr 2026 22:34:49 -0400 Subject: [PATCH] Bound peer-clear wait in `async_main` finally MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Fifth diagnostic pass pinpointed the hang to `async_main`'s finally block — every stuck actor reaches `FINALLY ENTER` but never `RETURNING`. Specifically `await ipc_server.wait_for_no_more_ peers()` never returns when a peer-channel handler is stuck: the `_no_more_peers` Event is set only when `server._peers` empties, and stuck handlers keep their channels registered. Wrap the call in `trio.move_on_after(3.0)` + a warning-log on timeout that records the still- connected peer count. 3s is enough for any graceful cancel-ack round-trip; beyond that we're in bug territory and need to proceed with local teardown so the parent's `_ForkedProc.wait()` can unblock. Defensive-in-depth regardless of the underlying bug — a local finally shouldn't block on remote cooperation forever. Verified: with this fix, ALL 15 actors reach `async_main: RETURNING` (up from 10/15 before). Test still hangs past 45s though — there's at least one MORE unbounded wait downstream of `async_main`. Candidates enumerated in the doc update (`open_root_actor` finally / `actor.cancel()` internals / trio.run bg tasks / `_serve_ipc_eps` finally). Skip-mark stays on `test_nested_multierrors[subint_forkserver]`. Also updates `subint_forkserver_test_cancellation_leak_issue.md` with the new pinpoint + summary of the 6-item investigation win list: 1. FD hygiene fix (`_close_inherited_fds`) — orphan-SIGINT closed 2. pidfd-based `_ForkedProc.wait` — cancellable 3. `_parent_chan_cs` wiring — shielded parent-chan loop now breakable 4. `wait_for_no_more_peers` bound — THIS commit 5. Ruled-out hypotheses: tree-kill missing, stuck socket recv, capture-pipe fill (all wrong) 6. Remaining unknown: at least one more unbounded wait in the teardown cascade above `async_main` (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code --- ...forkserver_test_cancellation_leak_issue.md | 87 +++++++++++++++++++ tractor/runtime/_runtime.py | 20 ++++- 2 files changed, 106 insertions(+), 1 deletion(-) diff --git a/ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md b/ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md index 0a1c4f5b..a91093b0 100644 --- a/ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md +++ b/ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md @@ -635,6 +635,93 @@ asymmetric. Not purely depth-determined. Some race condition in nursery teardown when multiple siblings error simultaneously. +## Update — 2026-04-23 (late, probe iteration 3): hang pinpointed to `wait_for_no_more_peers()` + +Further DIAGDEBUG at every milestone in `async_main` +(runtime UP / EXITED service_tn / EXITED root_tn / +FINALLY ENTER / RETURNING) plus `_ForkedProc.wait` +ENTER/RETURNED per-pidfd. Result: + +**Every stuck actor reaches `async_main: FINALLY +ENTER` but NOT `async_main: RETURNING`.** + +That isolates the hang to a specific await in +`async_main`'s finally block at +`tractor/runtime/_runtime.py:1837+`. The suspect: + +```python +# Ensure all peers (actors connected to us as clients) are finished +if ipc_server := actor.ipc_server and ipc_server.has_peers(check_chans=True): + ... + await ipc_server.wait_for_no_more_peers() # ← UNBOUNDED, blocks forever +``` + +`_no_more_peers` is an `Event` set only when +`server._peers` empties (see +`ipc/_server.py:526-530`). If ANY peer-handler is +stuck (the 5 unclosed loops from the earlier pass), +it keeps its channel in `server._peers`, so the +event never fires, so the wait hangs. + +### Applied fix (partial, landed as defensive-in-depth) + +`tractor/runtime/_runtime.py:1981` — +`wait_for_no_more_peers()` call now wrapped in +`trio.move_on_after(3.0)` + a warning log when the +timeout fires. Commented with the full rationale. + +**Verified:** with this fix, ALL 15 actors reach +`async_main: RETURNING` cleanly (up from 10/15 +reaching end before). + +**Unfortunately:** the test still hangs past 45s +total — meaning there's YET ANOTHER unbounded wait +downstream of `async_main`. The bounded +`wait_for_no_more_peers` unblocks one level, but +the cascade has another level above it. + +### Candidates for the remaining hang + +1. `open_root_actor`'s own finally / post- + `async_main` flow in `_root.py` — specifically + `await actor.cancel(None)` which has its own + internal waits. +2. The `trio.run()` itself doesn't return even + after the root task completes because trio's + nursery still has background tasks running. +3. Maybe `_serve_ipc_eps`'s finally has an await + that blocks when peers aren't clearing. + +### Current stance + +- Defensive `wait_for_no_more_peers` bound landed + (good hygiene regardless). Revealing a real + deadlock-avoidance gap in tractor's cleanup. +- Test still hangs → skip-mark restored on + `test_nested_multierrors[subint_forkserver]`. +- The full chain of unbounded waits needs another + session of drilling, probably at + `open_root_actor` / `actor.cancel` level. + +### Summary of this investigation's wins + +1. **FD hygiene fix** (`_close_inherited_fds`) — + correct, closed orphan-SIGINT sibling issue. +2. **pidfd-based `_ForkedProc.wait`** — cancellable, + matches trio_proc pattern. +3. **`_parent_chan_cs` wiring** — + `Actor.cancel()` now breaks the shielded parent- + chan `process_messages` loop. +4. **`wait_for_no_more_peers` bounded** — + prevents the actor-level finally hang. +5. **Ruled-out hypotheses:** tree-kill missing + (wrong), stuck socket recv (wrong), capture- + pipe fill (wrong). +6. **Pinpointed remaining unknown:** at least one + more unbounded wait in the teardown cascade + above `async_main`. Concrete candidates + enumerated above. + ## Stopgap (landed) `test_nested_multierrors` skip-marked under diff --git a/tractor/runtime/_runtime.py b/tractor/runtime/_runtime.py index 12b2473e..9dcca501 100644 --- a/tractor/runtime/_runtime.py +++ b/tractor/runtime/_runtime.py @@ -1973,7 +1973,25 @@ async def async_main( f' {pformat(ipc_server._peers)}' ) log.runtime(teardown_report) - await ipc_server.wait_for_no_more_peers() + # NOTE: bound the peer-clear wait — otherwise if any + # peer-channel handler is stuck (e.g. never got its + # cancel propagated due to a runtime bug), this wait + # blocks forever and deadlocks the whole actor-tree + # teardown cascade. 3s is enough for any graceful + # cancel-ack round-trip; beyond that we're in bug + # territory and need to proceed with local teardown + # so the parent's `_ForkedProc.wait()` can unblock. + # See `ai/conc-anal/ + # subint_forkserver_test_cancellation_leak_issue.md` + # for the full diagnosis. + with trio.move_on_after(3.0) as _peers_cs: + await ipc_server.wait_for_no_more_peers() + if _peers_cs.cancelled_caught: + teardown_report += ( + f'-> TIMED OUT waiting for peers to clear ' + f'({len(ipc_server._peers)} still connected)\n' + ) + log.warning(teardown_report) teardown_report += ( '-]> all peer channels are complete.\n'