tractor

History

Gud Boi e312a68d8a Bound peer-clear wait in `async_main` finally Fifth diagnostic pass pinpointed the hang to `async_main`'s finally block — every stuck actor reaches `FINALLY ENTER` but never `RETURNING`. Specifically `await ipc_server.wait_for_no_more_ peers()` never returns when a peer-channel handler is stuck: the `_no_more_peers` Event is set only when `server._peers` empties, and stuck handlers keep their channels registered. Wrap the call in `trio.move_on_after(3.0)` + a warning-log on timeout that records the still- connected peer count. 3s is enough for any graceful cancel-ack round-trip; beyond that we're in bug territory and need to proceed with local teardown so the parent's `_ForkedProc.wait()` can unblock. Defensive-in-depth regardless of the underlying bug — a local finally shouldn't block on remote cooperation forever. Verified: with this fix, ALL 15 actors reach `async_main: RETURNING` (up from 10/15 before). Test still hangs past 45s though — there's at least one MORE unbounded wait downstream of `async_main`. Candidates enumerated in the doc update (`open_root_actor` finally / `actor.cancel()` internals / trio.run bg tasks / `_serve_ipc_eps` finally). Skip-mark stays on `test_nested_multierrors[subint_forkserver]`. Also updates `subint_forkserver_test_cancellation_leak_issue.md` with the new pinpoint + summary of the 6-item investigation win list: 1. FD hygiene fix (`_close_inherited_fds`) — orphan-SIGINT closed 2. pidfd-based `_ForkedProc.wait` — cancellable 3. `_parent_chan_cs` wiring — shielded parent-chan loop now breakable 4. `wait_for_no_more_peers` bound — THIS commit 5. Ruled-out hypotheses: tree-kill missing, stuck socket recv, capture-pipe fill (all wrong) 6. Remaining unknown: at least one more unbounded wait in the teardown cascade above `async_main` (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 22:34:49 -04:00
..
conc-anal	Bound peer-clear wait in `async_main` finally	2026-04-23 22:34:49 -04:00
prompt-io	Add CPython-level `subint_fork` workaround smoketest	2026-04-23 18:48:34 -04:00

Gud Boi e312a68d8a Bound peer-clear wait in `async_main` finally

Fifth diagnostic pass pinpointed the hang to
`async_main`'s finally block — every stuck actor
reaches `FINALLY ENTER` but never `RETURNING`.
Specifically `await ipc_server.wait_for_no_more_
peers()` never returns when a peer-channel handler
is stuck: the `_no_more_peers` Event is set only
when `server._peers` empties, and stuck handlers
keep their channels registered.

Wrap the call in `trio.move_on_after(3.0)` + a
warning-log on timeout that records the still-
connected peer count. 3s is enough for any
graceful cancel-ack round-trip; beyond that we're
in bug territory and need to proceed with local
teardown so the parent's `_ForkedProc.wait()` can
unblock. Defensive-in-depth regardless of the
underlying bug — a local finally shouldn't block
on remote cooperation forever.

Verified: with this fix, ALL 15 actors reach
`async_main: RETURNING` (up from 10/15 before).

Test still hangs past 45s though — there's at
least one MORE unbounded wait downstream of
`async_main`. Candidates enumerated in the doc
update (`open_root_actor` finally /
`actor.cancel()` internals / trio.run bg tasks /
`_serve_ipc_eps` finally). Skip-mark stays on
`test_nested_multierrors[subint_forkserver]`.

Also updates
`subint_forkserver_test_cancellation_leak_issue.md`
with the new pinpoint + summary of the 6-item
investigation win list:
1. FD hygiene fix (`_close_inherited_fds`) —
   orphan-SIGINT closed
2. pidfd-based `_ForkedProc.wait` — cancellable
3. `_parent_chan_cs` wiring — shielded parent-chan
   loop now breakable
4. `wait_for_no_more_peers` bound — THIS commit
5. Ruled-out hypotheses: tree-kill missing, stuck
   socket recv, capture-pipe fill (all wrong)
6. Remaining unknown: at least one more unbounded
   wait in the teardown cascade above `async_main`

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

2026-04-23 22:34:49 -04:00

conc-anal

Bound peer-clear wait in `async_main` finally

2026-04-23 22:34:49 -04:00

prompt-io

Add CPython-level `subint_fork` workaround smoketest

2026-04-23 18:48:34 -04:00