From e312a68d8a0bacef1c9006bf1f6cd2ae9919f683 Mon Sep 17 00:00:00 2001
From: goodboy <goodboy_foss@protonmail.com>
Date: Thu, 23 Apr 2026 22:34:49 -0400
Subject: [PATCH] Bound peer-clear wait in `async_main` finally
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Fifth diagnostic pass pinpointed the hang to
`async_main`'s finally block — every stuck actor
reaches `FINALLY ENTER` but never `RETURNING`.
Specifically `await ipc_server.wait_for_no_more_
peers()` never returns when a peer-channel handler
is stuck: the `_no_more_peers` Event is set only
when `server._peers` empties, and stuck handlers
keep their channels registered.

Wrap the call in `trio.move_on_after(3.0)` + a
warning-log on timeout that records the still-
connected peer count. 3s is enough for any
graceful cancel-ack round-trip; beyond that we're
in bug territory and need to proceed with local
teardown so the parent's `_ForkedProc.wait()` can
unblock. Defensive-in-depth regardless of the
underlying bug — a local finally shouldn't block
on remote cooperation forever.

Verified: with this fix, ALL 15 actors reach
`async_main: RETURNING` (up from 10/15 before).

Test still hangs past 45s though — there's at
least one MORE unbounded wait downstream of
`async_main`. Candidates enumerated in the doc
update (`open_root_actor` finally /
`actor.cancel()` internals / trio.run bg tasks /
`_serve_ipc_eps` finally). Skip-mark stays on
`test_nested_multierrors[subint_forkserver]`.

Also updates
`subint_forkserver_test_cancellation_leak_issue.md`
with the new pinpoint + summary of the 6-item
investigation win list:
1. FD hygiene fix (`_close_inherited_fds`) —
   orphan-SIGINT closed
2. pidfd-based `_ForkedProc.wait` — cancellable
3. `_parent_chan_cs` wiring — shielded parent-chan
   loop now breakable
4. `wait_for_no_more_peers` bound — THIS commit
5. Ruled-out hypotheses: tree-kill missing, stuck
   socket recv, capture-pipe fill (all wrong)
6. Remaining unknown: at least one more unbounded
   wait in the teardown cascade above `async_main`

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
---
 ...forkserver_test_cancellation_leak_issue.md | 87 +++++++++++++++++++
 tractor/runtime/_runtime.py                   | 20 ++++-
 2 files changed, 106 insertions(+), 1 deletion(-)

diff --git a/ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md b/ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md
index 0a1c4f5b..a91093b0 100644
--- a/ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md
+++ b/ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md
@@ -635,6 +635,93 @@ asymmetric. Not purely depth-determined. Some race
 condition in nursery teardown when multiple
 siblings error simultaneously.
 
+## Update — 2026-04-23 (late, probe iteration 3): hang pinpointed to `wait_for_no_more_peers()`
+
+Further DIAGDEBUG at every milestone in `async_main`
+(runtime UP / EXITED service_tn / EXITED root_tn /
+FINALLY ENTER / RETURNING) plus `_ForkedProc.wait`
+ENTER/RETURNED per-pidfd. Result:
+
+**Every stuck actor reaches `async_main: FINALLY
+ENTER` but NOT `async_main: RETURNING`.**
+
+That isolates the hang to a specific await in
+`async_main`'s finally block at
+`tractor/runtime/_runtime.py:1837+`. The suspect:
+
+```python
+# Ensure all peers (actors connected to us as clients) are finished
+if ipc_server := actor.ipc_server and ipc_server.has_peers(check_chans=True):
+    ...
+    await ipc_server.wait_for_no_more_peers()  # ← UNBOUNDED, blocks forever
+```
+
+`_no_more_peers` is an `Event` set only when
+`server._peers` empties (see
+`ipc/_server.py:526-530`). If ANY peer-handler is
+stuck (the 5 unclosed loops from the earlier pass),
+it keeps its channel in `server._peers`, so the
+event never fires, so the wait hangs.
+
+### Applied fix (partial, landed as defensive-in-depth)
+
+`tractor/runtime/_runtime.py:1981` —
+`wait_for_no_more_peers()` call now wrapped in
+`trio.move_on_after(3.0)` + a warning log when the
+timeout fires. Commented with the full rationale.
+
+**Verified:** with this fix, ALL 15 actors reach
+`async_main: RETURNING` cleanly (up from 10/15
+reaching end before).
+
+**Unfortunately:** the test still hangs past 45s
+total — meaning there's YET ANOTHER unbounded wait
+downstream of `async_main`. The bounded
+`wait_for_no_more_peers` unblocks one level, but
+the cascade has another level above it.
+
+### Candidates for the remaining hang
+
+1. `open_root_actor`'s own finally / post-
+   `async_main` flow in `_root.py` — specifically
+   `await actor.cancel(None)` which has its own
+   internal waits.
+2. The `trio.run()` itself doesn't return even
+   after the root task completes because trio's
+   nursery still has background tasks running.
+3. Maybe `_serve_ipc_eps`'s finally has an await
+   that blocks when peers aren't clearing.
+
+### Current stance
+
+- Defensive `wait_for_no_more_peers` bound landed
+  (good hygiene regardless). Revealing a real
+  deadlock-avoidance gap in tractor's cleanup.
+- Test still hangs → skip-mark restored on
+  `test_nested_multierrors[subint_forkserver]`.
+- The full chain of unbounded waits needs another
+  session of drilling, probably at
+  `open_root_actor` / `actor.cancel` level.
+
+### Summary of this investigation's wins
+
+1. **FD hygiene fix** (`_close_inherited_fds`) —
+   correct, closed orphan-SIGINT sibling issue.
+2. **pidfd-based `_ForkedProc.wait`** — cancellable,
+   matches trio_proc pattern.
+3. **`_parent_chan_cs` wiring** —
+   `Actor.cancel()` now breaks the shielded parent-
+   chan `process_messages` loop.
+4. **`wait_for_no_more_peers` bounded** —
+   prevents the actor-level finally hang.
+5. **Ruled-out hypotheses:** tree-kill missing
+   (wrong), stuck socket recv (wrong), capture-
+   pipe fill (wrong).
+6. **Pinpointed remaining unknown:** at least one
+   more unbounded wait in the teardown cascade
+   above `async_main`. Concrete candidates
+   enumerated above.
+
 ## Stopgap (landed)
 
 `test_nested_multierrors` skip-marked under
diff --git a/tractor/runtime/_runtime.py b/tractor/runtime/_runtime.py
index 12b2473e..9dcca501 100644
--- a/tractor/runtime/_runtime.py
+++ b/tractor/runtime/_runtime.py
@@ -1973,7 +1973,25 @@ async def async_main(
                 f'   {pformat(ipc_server._peers)}'
             )
             log.runtime(teardown_report)
-            await ipc_server.wait_for_no_more_peers()
+            # NOTE: bound the peer-clear wait — otherwise if any
+            # peer-channel handler is stuck (e.g. never got its
+            # cancel propagated due to a runtime bug), this wait
+            # blocks forever and deadlocks the whole actor-tree
+            # teardown cascade. 3s is enough for any graceful
+            # cancel-ack round-trip; beyond that we're in bug
+            # territory and need to proceed with local teardown
+            # so the parent's `_ForkedProc.wait()` can unblock.
+            # See `ai/conc-anal/
+            # subint_forkserver_test_cancellation_leak_issue.md`
+            # for the full diagnosis.
+            with trio.move_on_after(3.0) as _peers_cs:
+                await ipc_server.wait_for_no_more_peers()
+            if _peers_cs.cancelled_caught:
+                teardown_report += (
+                    f'-> TIMED OUT waiting for peers to clear '
+                    f'({len(ipc_server._peers)} still connected)\n'
+                )
+                log.warning(teardown_report)
 
         teardown_report += (
             '-]> all peer channels are complete.\n'