Skip-mark + narrow `subint_forkserver` cancel hang
Two-part stopgap for the still-hanging
`test_nested_multierrors[subint_forkserver]`:
1. Skip-mark the test via
`@pytest.mark.skipon_spawn_backend('subint_forkserver',
reason=...)` so it stops blocking the test
matrix while the remaining bug is being chased.
The reason string cross-refs the conc-anal doc
for full context.
2. Update the conc-anal doc
(`subint_forkserver_test_cancellation_leak_issue.md`) with the
empirical state after the three nested- cancel fix commits
(`0cd0b633` FD scrub + `fe540d02` pidfd wait + `57935804` parent-chan
shield break) landed, narrowing the remaining hang from "everything
broken" to "peer-channel loops don't exit on `service_tn` cancel".
Deats from the DIAGDEBUG instrumentation pass,
- 80 `process_messages` ENTERs, 75 EXITs → 5 stuck
- ALL 40 `shield=True` ENTERs matched EXIT — the
`_parent_chan_cs.cancel()` wiring from `57935804`
works as intended for shielded loops.
- the 5 stuck loops are all `shield=False` peer-
channel handlers in `handle_stream_from_peer`
(inbound connections handled by
`stream_handler_tn`, which IS `service_tn` in the
current config).
- after `_parent_chan_cs.cancel()` fires, NEW
shielded loops appear on the session reg_addr
port — probably discovery-layer reconnection;
doesn't block teardown but indicates the cascade
has more moving parts than expected.
The remaining unknown: why don't the 5 peer-channel loops exit when
`service_tn.cancel_scope.cancel()` fires? They're not shielded, they're
inside the service_tn scope, a standard cancel should propagate through.
Some fork-config-specific divergence keeps them alive. Doc lists three
follow-up experiments (stackscope dump, side-by-side `trio_proc`
comparison, audit of the `tractor/ipc/_server.py:448` `except
trio.Cancelled:` path).
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
subint_forkserver_backend
parent
8ac3dfeb85
commit
506617c695
|
|
@ -306,13 +306,103 @@ root's `open_nursery` receives the
|
||||||
`BaseExceptionGroup` containing the `AssertionError`
|
`BaseExceptionGroup` containing the `AssertionError`
|
||||||
from the errorer and unwinds cleanly.
|
from the errorer and unwinds cleanly.
|
||||||
|
|
||||||
|
## Update — 2026-04-23: partial fix landed, deeper layer surfaced
|
||||||
|
|
||||||
|
Three improvements landed as separate commits in the
|
||||||
|
`subint_forkserver_backend` branch (see `git log`):
|
||||||
|
|
||||||
|
1. **`_close_inherited_fds()` in fork-child prelude**
|
||||||
|
(`tractor/spawn/_subint_forkserver.py`). POSIX
|
||||||
|
close-fds-equivalent enumeration via
|
||||||
|
`/proc/self/fd` (or `RLIMIT_NOFILE` fallback), keep
|
||||||
|
only stdio. This is fix-direction (1) from the list
|
||||||
|
above — went with the blunt form rather than the
|
||||||
|
targeted enum-via-`actor.ipc_server` form, turns
|
||||||
|
out the aggressive close is safe because every
|
||||||
|
inheritable resource the fresh child needs
|
||||||
|
(IPC-channel socket, etc.) is opened AFTER the
|
||||||
|
fork anyway.
|
||||||
|
2. **`_ForkedProc.wait()` via `os.pidfd_open()` +
|
||||||
|
`trio.lowlevel.wait_readable()`** — matches the
|
||||||
|
`trio.Process.wait` / `mp.Process.sentinel` pattern
|
||||||
|
used by `trio_proc` and `proc_waiter`. Gives us
|
||||||
|
fully trio-cancellable child-wait (prior impl
|
||||||
|
blocked a cache thread on a sync `os.waitpid` that
|
||||||
|
was NOT trio-cancellable due to
|
||||||
|
`abandon_on_cancel=False`).
|
||||||
|
3. **`_parent_chan_cs` wiring** in
|
||||||
|
`tractor/runtime/_runtime.py`: capture the shielded
|
||||||
|
`loop_cs` for the parent-channel `process_messages`
|
||||||
|
task in `async_main`; explicitly cancel it in
|
||||||
|
`Actor.cancel()` teardown. This breaks the shield
|
||||||
|
during teardown so the parent-chan loop exits when
|
||||||
|
cancel is issued, instead of parking on a parent-
|
||||||
|
socket EOF that might never arrive under fork
|
||||||
|
semantics.
|
||||||
|
|
||||||
|
**Concrete wins from (1):** the sibling
|
||||||
|
`subint_forkserver_orphan_sigint_hang_issue.md` class
|
||||||
|
is **now fixed** — `test_orphaned_subactor_sigint_cleanup_DRAFT`
|
||||||
|
went from strict-xfail to pass. The xfail mark was
|
||||||
|
removed; the test remains as a regression guard.
|
||||||
|
|
||||||
|
**test_nested_multierrors STILL hangs** though.
|
||||||
|
|
||||||
|
### Updated diagnosis (narrowed)
|
||||||
|
|
||||||
|
DIAGDEBUG instrumentation of `process_messages` ENTER/
|
||||||
|
EXIT pairs + `_parent_chan_cs.cancel()` call sites
|
||||||
|
showed (captured during a 20s-timeout repro):
|
||||||
|
|
||||||
|
- 80 `process_messages` ENTERs, 75 EXITs → 5 stuck.
|
||||||
|
- **All 40 `shield=True` ENTERs matched EXIT** — every
|
||||||
|
shielded parent-chan loop exits cleanly. The
|
||||||
|
`_parent_chan_cs` wiring works as intended.
|
||||||
|
- **The 5 stuck loops are all `shield=False`** — peer-
|
||||||
|
channel handlers (inbound connections handled by
|
||||||
|
`handle_stream_from_peer` in stream_handler_tn).
|
||||||
|
- After our `_parent_chan_cs.cancel()` fires, NEW
|
||||||
|
shielded process_messages loops start (on the
|
||||||
|
session reg_addr port — probably discovery-layer
|
||||||
|
reconnection attempts). These don't block teardown
|
||||||
|
(they all exit) but indicate the cancel cascade has
|
||||||
|
more moving parts than expected.
|
||||||
|
|
||||||
|
### Remaining unknown
|
||||||
|
|
||||||
|
Why don't the 5 peer-channel loops exit when
|
||||||
|
`service_tn.cancel_scope.cancel()` fires? They're in
|
||||||
|
`stream_handler_tn` which IS `service_tn` in the
|
||||||
|
current configuration (`open_ipc_server(parent_tn=
|
||||||
|
service_tn, stream_handler_tn=service_tn)`). A
|
||||||
|
standard nursery-scope-cancel should propagate through
|
||||||
|
them — no shield, no special handler. Something
|
||||||
|
specific to the fork-spawned configuration keeps them
|
||||||
|
alive.
|
||||||
|
|
||||||
|
Candidate follow-up experiments:
|
||||||
|
|
||||||
|
- Dump the trio task tree at the hang point (via
|
||||||
|
`stackscope` or direct trio introspection) to see
|
||||||
|
what each stuck loop is awaiting. `chan.__anext__`
|
||||||
|
on a socket recv? An inner lock? A shielded sub-task?
|
||||||
|
- Compare peer-channel handler lifecycle under
|
||||||
|
`trio_proc` vs `subint_forkserver` with equivalent
|
||||||
|
logging to spot the divergence.
|
||||||
|
- Investigate whether the peer handler is caught in
|
||||||
|
the `except trio.Cancelled:` path at
|
||||||
|
`tractor/ipc/_server.py:448` that re-raises — but
|
||||||
|
re-raise means it should still exit. Unless
|
||||||
|
something higher up swallows it.
|
||||||
|
|
||||||
## Stopgap (landed)
|
## Stopgap (landed)
|
||||||
|
|
||||||
Until the fix lands, `test_nested_multierrors` +
|
`test_nested_multierrors` skip-marked under
|
||||||
related multi-level-spawn tests can be skip-marked
|
`subint_forkserver` via
|
||||||
under `subint_forkserver` via
|
|
||||||
`@pytest.mark.skipon_spawn_backend('subint_forkserver',
|
`@pytest.mark.skipon_spawn_backend('subint_forkserver',
|
||||||
reason='...')`. Cross-ref this doc.
|
reason='...')`, cross-referenced to this doc. Mark
|
||||||
|
should be dropped once the peer-channel-loop exit
|
||||||
|
issue is fixed.
|
||||||
|
|
||||||
## References
|
## References
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -452,6 +452,19 @@ async def spawn_and_error(
|
||||||
await nursery.run_in_actor(*args, **kwargs)
|
await nursery.run_in_actor(*args, **kwargs)
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.skipon_spawn_backend(
|
||||||
|
'subint_forkserver',
|
||||||
|
reason=(
|
||||||
|
'Multi-level fork-spawn cancel cascade hang — '
|
||||||
|
'peer-channel `process_messages` loops do not '
|
||||||
|
'exit on `service_tn.cancel_scope.cancel()`. '
|
||||||
|
'See `ai/conc-anal/'
|
||||||
|
'subint_forkserver_test_cancellation_leak_issue.md` '
|
||||||
|
'for the full diagnosis + candidate fix directions. '
|
||||||
|
'Drop this mark once the peer-chan-loop exit issue '
|
||||||
|
'is closed.'
|
||||||
|
),
|
||||||
|
)
|
||||||
@pytest.mark.timeout(
|
@pytest.mark.timeout(
|
||||||
10,
|
10,
|
||||||
method='thread',
|
method='thread',
|
||||||
|
|
|
||||||
Loading…
Reference in New Issue