Compare commits
62 Commits
a4d6318ca7
...
4106ba73ea
| Author | SHA1 | Date |
|---|---|---|
|
|
4106ba73ea | |
|
|
eceed29d4a | |
|
|
e312a68d8a | |
|
|
4d0555435b | |
|
|
ab86f7613d | |
|
|
458a35cf09 | |
|
|
7cd47ef7fb | |
|
|
76d12060aa | |
|
|
506617c695 | |
|
|
8ac3dfeb85 | |
|
|
c20b05e181 | |
|
|
9993db0193 | |
|
|
35da808905 | |
|
|
70d58c4bd2 | |
|
|
1af2121057 | |
|
|
e3f4f5a387 | |
|
|
d093c31979 | |
|
|
1e357dcf08 | |
|
|
e31eb8d7c9 | |
|
|
8bcbe730bf | |
|
|
5e85f184e0 | |
|
|
f5f37b69e6 | |
|
|
a72deef709 | |
|
|
dcd5c1ff40 | |
|
|
76605d5609 | |
|
|
7804a9feac | |
|
|
63ab7c986b | |
|
|
26914fde75 | |
|
|
cf2e71d87f | |
|
|
25e400d526 | |
|
|
82332fbceb | |
|
|
de4f470b6c | |
|
|
0f48ed2eb9 | |
|
|
eee79a0357 | |
|
|
4b2a0886c3 | |
|
|
3b26b59dad | |
|
|
f3cea714bc | |
|
|
985ea76de5 | |
|
|
5998774535 | |
|
|
a6cbac954d | |
|
|
189f4e3ffc | |
|
|
a65fded4c6 | |
|
|
4a3254583b | |
|
|
2ed5e6a6e8 | |
|
|
34d9d482e4 | |
|
|
09466a1e9d | |
|
|
99541feec7 | |
|
|
c041518bdb | |
|
|
31cbd11a5b | |
|
|
8a8d01e076 | |
|
|
03bf2b931e | |
|
|
b8f243e98d | |
|
|
d2ea8aa2de | |
|
|
d318f1f8f4 | |
|
|
64ddc42ad8 | |
|
|
b524ee4633 | |
|
|
b1a0753a3f | |
|
|
ba86d482e3 | |
|
|
d3d6f646f9 | |
|
|
9cf3d588e7 | |
|
|
e75e29b1dc | |
|
|
a7b1ee34ef |
|
|
@ -229,3 +229,69 @@ Unlike asyncio, trio allows checkpoints in
|
||||||
that does `await` can itself be cancelled (e.g.
|
that does `await` can itself be cancelled (e.g.
|
||||||
by nursery shutdown). Watch for cleanup code that
|
by nursery shutdown). Watch for cleanup code that
|
||||||
assumes it will run to completion.
|
assumes it will run to completion.
|
||||||
|
|
||||||
|
### Unbounded waits in cleanup paths
|
||||||
|
|
||||||
|
Any `await <event>.wait()` in a teardown path is
|
||||||
|
a latent deadlock unless the event's setter is
|
||||||
|
GUARANTEED to fire. If the setter depends on
|
||||||
|
external state (peer disconnects, child process
|
||||||
|
exit, subsequent task completion) that itself
|
||||||
|
depends on the current task's progress, you have
|
||||||
|
a mutual wait.
|
||||||
|
|
||||||
|
Rule: **bound every `await X.wait()` in cleanup
|
||||||
|
paths with `trio.move_on_after()`** unless you
|
||||||
|
can prove the setter is unconditionally reachable
|
||||||
|
from the state at the await site. Concrete recent
|
||||||
|
example: `ipc_server.wait_for_no_more_peers()` in
|
||||||
|
`async_main`'s finally (see
|
||||||
|
`ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md`
|
||||||
|
"probe iteration 3") — it was unbounded, and when
|
||||||
|
one peer-handler was stuck the wait-for-no-more-
|
||||||
|
peers event never fired, deadlocking the whole
|
||||||
|
actor-tree teardown cascade.
|
||||||
|
|
||||||
|
### The capture-pipe-fill hang pattern (grep this first)
|
||||||
|
|
||||||
|
When investigating any hang in the test suite
|
||||||
|
**especially under fork-based backends**, first
|
||||||
|
check whether the hang reproduces under `pytest
|
||||||
|
-s` (`--capture=no`). If `-s` makes it go away
|
||||||
|
you're not looking at a trio concurrency bug —
|
||||||
|
you're looking at a Linux pipe-buffer fill.
|
||||||
|
|
||||||
|
Mechanism: pytest replaces fds 1,2 with pipe
|
||||||
|
write-ends. Fork-child subactors inherit those
|
||||||
|
fds. High-volume error-log tracebacks (cancel
|
||||||
|
cascade spew) fill the 64KB pipe buffer. Child
|
||||||
|
`write()` blocks. Child can't exit. Parent's
|
||||||
|
`waitpid`/pidfd wait blocks. Deadlock cascades up
|
||||||
|
the tree.
|
||||||
|
|
||||||
|
Pre-existing guards in `tests/conftest.py` encode
|
||||||
|
this knowledge — grep these BEFORE blaming
|
||||||
|
concurrency:
|
||||||
|
|
||||||
|
```python
|
||||||
|
# tests/conftest.py:258
|
||||||
|
if loglevel in ('trace', 'debug'):
|
||||||
|
# XXX: too much logging will lock up the subproc (smh)
|
||||||
|
loglevel: str = 'info'
|
||||||
|
|
||||||
|
# tests/conftest.py:316
|
||||||
|
# can lock up on the `_io.BufferedReader` and hang..
|
||||||
|
stderr: str = proc.stderr.read().decode()
|
||||||
|
```
|
||||||
|
|
||||||
|
Full post-mortem +
|
||||||
|
`ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md`
|
||||||
|
for the canonical reproduction. Cost several
|
||||||
|
investigation sessions before catching it —
|
||||||
|
because the capture-pipe symptom was masked by
|
||||||
|
deeper cascade-deadlocks. Once the cascades were
|
||||||
|
fixed, the tree tore down enough to generate
|
||||||
|
pipe-filling log volume → capture-pipe finally
|
||||||
|
surfaced. Grep-note for future-self: **if a
|
||||||
|
multi-subproc tractor test hangs, `pytest -s`
|
||||||
|
first, conc-anal second.**
|
||||||
|
|
|
||||||
|
|
@ -451,3 +451,73 @@ by your changes — note them and move on.
|
||||||
**Rule of thumb**: if a test fails with `TooSlowError`,
|
**Rule of thumb**: if a test fails with `TooSlowError`,
|
||||||
`trio.TooSlowError`, or `pexpect.TIMEOUT` and you didn't
|
`trio.TooSlowError`, or `pexpect.TIMEOUT` and you didn't
|
||||||
touch the relevant code path, it's flaky — skip it.
|
touch the relevant code path, it's flaky — skip it.
|
||||||
|
|
||||||
|
## 9. The pytest-capture hang pattern (CHECK THIS FIRST)
|
||||||
|
|
||||||
|
**Symptom:** a tractor test hangs indefinitely under
|
||||||
|
default `pytest` but passes instantly when you add
|
||||||
|
`-s` (`--capture=no`).
|
||||||
|
|
||||||
|
**Cause:** tractor subactors (especially under fork-
|
||||||
|
based backends) inherit pytest's stdout/stderr
|
||||||
|
capture pipes via fds 1,2. Under high-volume error
|
||||||
|
logging (e.g. multi-level cancel cascade, nested
|
||||||
|
`run_in_actor` failures, anything triggering
|
||||||
|
`RemoteActorError` + `ExceptionGroup` traceback
|
||||||
|
spew), the **64KB Linux pipe buffer fills** faster
|
||||||
|
than pytest drains it. Subactor writes block → can't
|
||||||
|
finish exit → parent's `waitpid`/pidfd wait blocks →
|
||||||
|
deadlock cascades up the tree.
|
||||||
|
|
||||||
|
**Pre-existing guards in the tractor harness** that
|
||||||
|
encode this same knowledge — grep these FIRST
|
||||||
|
before spelunking:
|
||||||
|
|
||||||
|
- `tests/conftest.py:258-260` (in the `daemon`
|
||||||
|
fixture): `# XXX: too much logging will lock up
|
||||||
|
the subproc (smh)` — downgrades `trace`/`debug`
|
||||||
|
loglevel to `info` to prevent the hang.
|
||||||
|
- `tests/conftest.py:316`: `# can lock up on the
|
||||||
|
_io.BufferedReader and hang..` — noted on the
|
||||||
|
`proc.stderr.read()` post-SIGINT.
|
||||||
|
|
||||||
|
**Debug recipe (in priority order):**
|
||||||
|
|
||||||
|
1. **Try `-s` first.** If the hang disappears with
|
||||||
|
`pytest -s`, you've confirmed it's capture-pipe
|
||||||
|
fill. Skip spelunking.
|
||||||
|
2. **Lower the loglevel.** Default `--ll=error` on
|
||||||
|
this project; if you've bumped it to `debug` /
|
||||||
|
`info`, try dropping back. Each log level
|
||||||
|
multiplies pipe-pressure under fault cascades.
|
||||||
|
3. **If you MUST use default capture + high log
|
||||||
|
volume**, redirect subactor stdout/stderr in the
|
||||||
|
child prelude (e.g.
|
||||||
|
`tractor.spawn._subint_forkserver._child_target`
|
||||||
|
post-`_close_inherited_fds`) to `/dev/null` or a
|
||||||
|
file.
|
||||||
|
|
||||||
|
**Signature tells you it's THIS bug (vs. a real
|
||||||
|
code hang):**
|
||||||
|
|
||||||
|
- Multi-actor test under fork-based backend
|
||||||
|
(`subint_forkserver`, eventually `trio_proc` too
|
||||||
|
under enough log volume).
|
||||||
|
- Multiple `RemoteActorError` / `ExceptionGroup`
|
||||||
|
tracebacks in the error path.
|
||||||
|
- Test passes with `-s` in the 5-10s range, hangs
|
||||||
|
past pytest-timeout (usually 30+ s) without `-s`.
|
||||||
|
- Subactor processes visible via `pgrep -af
|
||||||
|
subint-forkserv` or similar after the hang —
|
||||||
|
they're alive but blocked on `write()` to an
|
||||||
|
inherited stdout fd.
|
||||||
|
|
||||||
|
**Historical reference:** this deadlock cost a
|
||||||
|
multi-session investigation (4 genuine cascade
|
||||||
|
fixes landed along the way) that only surfaced the
|
||||||
|
capture-pipe issue AFTER the deeper fixes let the
|
||||||
|
tree actually tear down enough to produce pipe-
|
||||||
|
filling log volume. Full post-mortem in
|
||||||
|
`ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md`.
|
||||||
|
Lesson codified here so future-me grep-finds the
|
||||||
|
workaround before digging.
|
||||||
|
|
|
||||||
|
|
@ -395,6 +395,432 @@ Candidate follow-up experiments:
|
||||||
re-raise means it should still exit. Unless
|
re-raise means it should still exit. Unless
|
||||||
something higher up swallows it.
|
something higher up swallows it.
|
||||||
|
|
||||||
|
### Attempted fix (DID NOT work) — hypothesis (3)
|
||||||
|
|
||||||
|
Tried: in `_serve_ipc_eps` finally, after closing
|
||||||
|
listeners, also iterate `server._peers` and
|
||||||
|
sync-close each peer channel's underlying stream
|
||||||
|
socket fd:
|
||||||
|
|
||||||
|
```python
|
||||||
|
for _uid, _chans in list(server._peers.items()):
|
||||||
|
for _chan in _chans:
|
||||||
|
try:
|
||||||
|
_stream = _chan._transport.stream if _chan._transport else None
|
||||||
|
if _stream is not None:
|
||||||
|
_stream.socket.close() # sync fd close
|
||||||
|
except (AttributeError, OSError):
|
||||||
|
pass
|
||||||
|
```
|
||||||
|
|
||||||
|
Theory: closing the socket fd from outside the stuck
|
||||||
|
recv task would make the recv see EBADF /
|
||||||
|
ClosedResourceError and unblock.
|
||||||
|
|
||||||
|
Result: `test_nested_multierrors[subint_forkserver]`
|
||||||
|
still hangs identically. Either:
|
||||||
|
- The sync `socket.close()` doesn't propagate into
|
||||||
|
trio's in-flight `recv_some()` the way I expected
|
||||||
|
(trio may hold an internal reference that keeps the
|
||||||
|
fd open even after an external close), or
|
||||||
|
- The stuck recv isn't even the root blocker and the
|
||||||
|
peer handlers never reach the finally for some
|
||||||
|
reason I haven't understood yet.
|
||||||
|
|
||||||
|
Either way, the sync-close hypothesis is **ruled
|
||||||
|
out**. Reverted the experiment, restored the skip-
|
||||||
|
mark on the test.
|
||||||
|
|
||||||
|
### Aside: `-s` flag does NOT change `test_nested_multierrors` behavior
|
||||||
|
|
||||||
|
Tested explicitly: both with and without `-s`, the
|
||||||
|
test hangs identically. So the capture-pipe-fill
|
||||||
|
hypothesis is **ruled out** for this test.
|
||||||
|
|
||||||
|
The earlier `test_context_stream_semantics.py` `-s`
|
||||||
|
observation was most likely caused by a competing
|
||||||
|
pytest run in my session (confirmed via process list
|
||||||
|
— my leftover pytest was alive at that time and
|
||||||
|
could have been holding state on the default
|
||||||
|
registry port).
|
||||||
|
|
||||||
|
## Update — 2026-04-23 (late): cancel delivery ruled in, nursery-wait ruled BLOCKER
|
||||||
|
|
||||||
|
**New diagnostic run** instrumented
|
||||||
|
`handle_stream_from_peer` at ENTER / `except
|
||||||
|
trio.Cancelled:` / finally, plus `Actor.cancel()`
|
||||||
|
just before `self._parent_chan_cs.cancel()`. Result:
|
||||||
|
|
||||||
|
- **40 `handle_stream_from_peer` ENTERs**.
|
||||||
|
- **0 `except trio.Cancelled:` hits** — cancel
|
||||||
|
never fires on any peer-handler.
|
||||||
|
- **35 finally hits** — those handlers exit via
|
||||||
|
peer-initiated EOF (normal return), NOT cancel.
|
||||||
|
- **5 handlers never reach finally** — stuck forever.
|
||||||
|
- **`Actor.cancel()` fired in 12 PIDs** — but the
|
||||||
|
PIDs with peer handlers that DIDN'T fire
|
||||||
|
Actor.cancel are exactly **root + 2 direct
|
||||||
|
spawners**. These 3 actors have peer handlers
|
||||||
|
(for their own subactors) that stay stuck because
|
||||||
|
**`Actor.cancel()` at these levels never runs**.
|
||||||
|
|
||||||
|
### The actual deadlock shape
|
||||||
|
|
||||||
|
`Actor.cancel()` lives in
|
||||||
|
`open_root_actor.__aexit__` / `async_main` teardown.
|
||||||
|
That only runs when the enclosing `async with
|
||||||
|
tractor.open_nursery()` exits. The nursery's
|
||||||
|
`__aexit__` calls the backend `*_proc` spawn target's
|
||||||
|
teardown, which does `soft_kill() →
|
||||||
|
_ForkedProc.wait()` on its child PID. That wait is
|
||||||
|
trio-cancellable via pidfd now (good) — but nothing
|
||||||
|
CANCELS it because the outer scope only cancels when
|
||||||
|
`Actor.cancel()` runs, which only runs when the
|
||||||
|
nursery completes, which waits on the child.
|
||||||
|
|
||||||
|
It's a **multi-level mutual wait**:
|
||||||
|
|
||||||
|
```
|
||||||
|
root blocks on spawner.wait()
|
||||||
|
spawner blocks on grandchild.wait()
|
||||||
|
grandchild blocks on errorer.wait()
|
||||||
|
errorer Actor.cancel() ran, but process
|
||||||
|
may not have fully exited yet
|
||||||
|
(something in root_tn holding on?)
|
||||||
|
```
|
||||||
|
|
||||||
|
Each level waits for the level below. The bottom
|
||||||
|
level (errorer) reaches Actor.cancel(), but its
|
||||||
|
process may not fully exit — meaning its pidfd
|
||||||
|
doesn't go readable, meaning the grandchild's
|
||||||
|
waitpid doesn't return, meaning the grandchild's
|
||||||
|
nursery doesn't unwind, etc. all the way up.
|
||||||
|
|
||||||
|
### Refined question
|
||||||
|
|
||||||
|
**Why does an errorer process not exit after its
|
||||||
|
`Actor.cancel()` completes?**
|
||||||
|
|
||||||
|
Possibilities:
|
||||||
|
1. `_parent_chan_cs.cancel()` fires (shielded
|
||||||
|
parent-chan loop unshielded), but the task is
|
||||||
|
stuck INSIDE the shielded loop's recv in a way
|
||||||
|
that cancel still can't break.
|
||||||
|
2. After `Actor.cancel()` returns, `async_main`
|
||||||
|
still has other tasks in `root_tn` waiting for
|
||||||
|
something that never arrives (e.g. outbound
|
||||||
|
IPC reply delivery).
|
||||||
|
3. The `os._exit(rc)` in `_worker` (at
|
||||||
|
`_subint_forkserver.py`) doesn't run because
|
||||||
|
`_child_target` never returns.
|
||||||
|
|
||||||
|
Next-session candidate probes (in priority order):
|
||||||
|
|
||||||
|
1. **Instrument `_worker`'s fork-child branch** to
|
||||||
|
confirm whether `child_target()` returns (and
|
||||||
|
thus `os._exit(rc)` is reached) for errorer
|
||||||
|
PIDs. If yes → process should die; if no →
|
||||||
|
trace back into `_actor_child_main` /
|
||||||
|
`_trio_main` / `async_main` to find the stuck
|
||||||
|
spot.
|
||||||
|
2. **Instrument `async_main`'s final unwind** to
|
||||||
|
see which await in the teardown doesn't
|
||||||
|
complete.
|
||||||
|
3. **Compare under `trio_proc` backend** at the
|
||||||
|
same `_worker`-equivalent level to see where
|
||||||
|
the flows diverge.
|
||||||
|
|
||||||
|
### Rule-out: NOT a stuck peer-chan recv
|
||||||
|
|
||||||
|
Earlier hypothesis was that the 5 stuck peer-chan
|
||||||
|
loops were blocked on a socket recv that cancel
|
||||||
|
couldn't interrupt. This pass revealed the real
|
||||||
|
cause: cancel **never reaches those tasks** because
|
||||||
|
their owning actor's `Actor.cancel()` never runs.
|
||||||
|
The recvs are fine — they're just parked because
|
||||||
|
nothing is telling them to stop.
|
||||||
|
|
||||||
|
## Update — 2026-04-23 (very late): leaves exit, middle actors stuck in `trio.run`
|
||||||
|
|
||||||
|
Yet another instrumentation pass — this time
|
||||||
|
printing at:
|
||||||
|
|
||||||
|
- `_worker` child branch: `pre child_target()` /
|
||||||
|
`child_target RETURNED rc=N` / `about to
|
||||||
|
os._exit(rc)`
|
||||||
|
- `_trio_main`: `about to trio.run` /
|
||||||
|
`trio.run RETURNED NORMALLY` / `FINALLY`
|
||||||
|
|
||||||
|
**Fresh-run results** (`test_nested_multierrors[
|
||||||
|
subint_forkserver]`, depth=1/breadth=2, 1 root + 14
|
||||||
|
forked = 15 actors total):
|
||||||
|
|
||||||
|
- **9 processes completed the full flow** —
|
||||||
|
`trio.run RETURNED NORMALLY` → `child_target
|
||||||
|
RETURNED rc=0` → `about to os._exit(0)`. These
|
||||||
|
are the LEAVES of the tree (errorer actors) plus
|
||||||
|
their direct parents (depth-0 spawners). They
|
||||||
|
actually exit their processes.
|
||||||
|
- **5 processes are stuck INSIDE `trio.run(trio_main)`**
|
||||||
|
— they hit "about to trio.run" but NEVER see
|
||||||
|
"trio.run RETURNED NORMALLY". These are root +
|
||||||
|
top-level spawners + one intermediate.
|
||||||
|
|
||||||
|
**What this means:** `async_main` itself is the
|
||||||
|
deadlock holder, not the peer-channel loops.
|
||||||
|
Specifically, the outer `async with root_tn:` in
|
||||||
|
`async_main` never exits for the 5 stuck actors.
|
||||||
|
Their `trio.run` never returns → `_trio_main`
|
||||||
|
catch/finally never runs → `_worker` never reaches
|
||||||
|
`os._exit(rc)` → the PROCESS never dies → its
|
||||||
|
parent's `_ForkedProc.wait()` blocks → parent's
|
||||||
|
nursery hangs → parent's `async_main` hangs → ...
|
||||||
|
|
||||||
|
### The new precise question
|
||||||
|
|
||||||
|
**What task in the 5 stuck actors' `async_main`
|
||||||
|
never completes?** Candidates:
|
||||||
|
|
||||||
|
1. The shielded parent-chan `process_messages`
|
||||||
|
task in `root_tn` — but we explicitly cancel it
|
||||||
|
via `_parent_chan_cs.cancel()` in `Actor.cancel()`.
|
||||||
|
However, `Actor.cancel()` only runs during
|
||||||
|
`open_root_actor.__aexit__`, which itself runs
|
||||||
|
only after `async_main`'s outer unwind — which
|
||||||
|
doesn't happen. So the shield isn't broken.
|
||||||
|
|
||||||
|
2. `await actor_nursery._join_procs.wait()` or
|
||||||
|
similar in the inline backend `*_proc` flow.
|
||||||
|
|
||||||
|
3. `_ForkedProc.wait()` on a grandchild that
|
||||||
|
actually DID exit — but the pidfd_open watch
|
||||||
|
didn't fire for some reason (race between
|
||||||
|
pidfd_open and the child exiting?).
|
||||||
|
|
||||||
|
The most specific next probe: **add DIAG around
|
||||||
|
`_ForkedProc.wait()` enter/exit** to see whether
|
||||||
|
the pidfd-based wait returns for every grandchild
|
||||||
|
exit. If a stuck parent's `_ForkedProc.wait()`
|
||||||
|
NEVER returns despite its child exiting, the
|
||||||
|
pidfd mechanism has a race bug under nested
|
||||||
|
forkserver.
|
||||||
|
|
||||||
|
Alternative probe: instrument `async_main`'s outer
|
||||||
|
nursery exits to find which nursery's `__aexit__`
|
||||||
|
is stuck, drilling down from `trio.run` to the
|
||||||
|
specific `async with` that never completes.
|
||||||
|
|
||||||
|
### Cascade summary (updated tree view)
|
||||||
|
|
||||||
|
```
|
||||||
|
ROOT (pytest) STUCK in trio.run
|
||||||
|
├── top_0 (spawner, d=1) STUCK in trio.run
|
||||||
|
│ ├── spawner_0_d1_0 (d=0) exited (os._exit 0)
|
||||||
|
│ │ ├── errorer_0_0 exited (os._exit 0)
|
||||||
|
│ │ └── errorer_0_1 exited (os._exit 0)
|
||||||
|
│ └── spawner_0_d1_1 (d=0) exited (os._exit 0)
|
||||||
|
│ ├── errorer_0_2 exited (os._exit 0)
|
||||||
|
│ └── errorer_0_3 exited (os._exit 0)
|
||||||
|
└── top_1 (spawner, d=1) STUCK in trio.run
|
||||||
|
├── spawner_1_d1_0 (d=0) STUCK in trio.run (sibling race?)
|
||||||
|
│ ├── errorer_1_0 exited
|
||||||
|
│ └── errorer_1_1 exited
|
||||||
|
└── spawner_1_d1_1 (d=0) STUCK in trio.run
|
||||||
|
├── errorer_1_2 exited
|
||||||
|
└── errorer_1_3 exited
|
||||||
|
```
|
||||||
|
|
||||||
|
Grandchildren (d=0 spawners) exit OR stick —
|
||||||
|
asymmetric. Not purely depth-determined. Some race
|
||||||
|
condition in nursery teardown when multiple
|
||||||
|
siblings error simultaneously.
|
||||||
|
|
||||||
|
## Update — 2026-04-23 (late, probe iteration 3): hang pinpointed to `wait_for_no_more_peers()`
|
||||||
|
|
||||||
|
Further DIAGDEBUG at every milestone in `async_main`
|
||||||
|
(runtime UP / EXITED service_tn / EXITED root_tn /
|
||||||
|
FINALLY ENTER / RETURNING) plus `_ForkedProc.wait`
|
||||||
|
ENTER/RETURNED per-pidfd. Result:
|
||||||
|
|
||||||
|
**Every stuck actor reaches `async_main: FINALLY
|
||||||
|
ENTER` but NOT `async_main: RETURNING`.**
|
||||||
|
|
||||||
|
That isolates the hang to a specific await in
|
||||||
|
`async_main`'s finally block at
|
||||||
|
`tractor/runtime/_runtime.py:1837+`. The suspect:
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Ensure all peers (actors connected to us as clients) are finished
|
||||||
|
if ipc_server := actor.ipc_server and ipc_server.has_peers(check_chans=True):
|
||||||
|
...
|
||||||
|
await ipc_server.wait_for_no_more_peers() # ← UNBOUNDED, blocks forever
|
||||||
|
```
|
||||||
|
|
||||||
|
`_no_more_peers` is an `Event` set only when
|
||||||
|
`server._peers` empties (see
|
||||||
|
`ipc/_server.py:526-530`). If ANY peer-handler is
|
||||||
|
stuck (the 5 unclosed loops from the earlier pass),
|
||||||
|
it keeps its channel in `server._peers`, so the
|
||||||
|
event never fires, so the wait hangs.
|
||||||
|
|
||||||
|
### Applied fix (partial, landed as defensive-in-depth)
|
||||||
|
|
||||||
|
`tractor/runtime/_runtime.py:1981` —
|
||||||
|
`wait_for_no_more_peers()` call now wrapped in
|
||||||
|
`trio.move_on_after(3.0)` + a warning log when the
|
||||||
|
timeout fires. Commented with the full rationale.
|
||||||
|
|
||||||
|
**Verified:** with this fix, ALL 15 actors reach
|
||||||
|
`async_main: RETURNING` cleanly (up from 10/15
|
||||||
|
reaching end before).
|
||||||
|
|
||||||
|
**Unfortunately:** the test still hangs past 45s
|
||||||
|
total — meaning there's YET ANOTHER unbounded wait
|
||||||
|
downstream of `async_main`. The bounded
|
||||||
|
`wait_for_no_more_peers` unblocks one level, but
|
||||||
|
the cascade has another level above it.
|
||||||
|
|
||||||
|
### Candidates for the remaining hang
|
||||||
|
|
||||||
|
1. `open_root_actor`'s own finally / post-
|
||||||
|
`async_main` flow in `_root.py` — specifically
|
||||||
|
`await actor.cancel(None)` which has its own
|
||||||
|
internal waits.
|
||||||
|
2. The `trio.run()` itself doesn't return even
|
||||||
|
after the root task completes because trio's
|
||||||
|
nursery still has background tasks running.
|
||||||
|
3. Maybe `_serve_ipc_eps`'s finally has an await
|
||||||
|
that blocks when peers aren't clearing.
|
||||||
|
|
||||||
|
### Current stance
|
||||||
|
|
||||||
|
- Defensive `wait_for_no_more_peers` bound landed
|
||||||
|
(good hygiene regardless). Revealing a real
|
||||||
|
deadlock-avoidance gap in tractor's cleanup.
|
||||||
|
- Test still hangs → skip-mark restored on
|
||||||
|
`test_nested_multierrors[subint_forkserver]`.
|
||||||
|
- The full chain of unbounded waits needs another
|
||||||
|
session of drilling, probably at
|
||||||
|
`open_root_actor` / `actor.cancel` level.
|
||||||
|
|
||||||
|
### Summary of this investigation's wins
|
||||||
|
|
||||||
|
1. **FD hygiene fix** (`_close_inherited_fds`) —
|
||||||
|
correct, closed orphan-SIGINT sibling issue.
|
||||||
|
2. **pidfd-based `_ForkedProc.wait`** — cancellable,
|
||||||
|
matches trio_proc pattern.
|
||||||
|
3. **`_parent_chan_cs` wiring** —
|
||||||
|
`Actor.cancel()` now breaks the shielded parent-
|
||||||
|
chan `process_messages` loop.
|
||||||
|
4. **`wait_for_no_more_peers` bounded** —
|
||||||
|
prevents the actor-level finally hang.
|
||||||
|
5. **Ruled-out hypotheses:** tree-kill missing
|
||||||
|
(wrong), stuck socket recv (wrong).
|
||||||
|
6. **Pinpointed remaining unknown:** at least one
|
||||||
|
more unbounded wait in the teardown cascade
|
||||||
|
above `async_main`. Concrete candidates
|
||||||
|
enumerated above.
|
||||||
|
|
||||||
|
## Update — 2026-04-23 (VERY late): pytest capture pipe IS the final gate
|
||||||
|
|
||||||
|
After landing fixes 1-4 and instrumenting every
|
||||||
|
layer down to `tractor_test`'s `trio.run(_main)`:
|
||||||
|
|
||||||
|
**Empirical result: with `pytest -s` the test PASSES
|
||||||
|
in 6.20s.** Without `-s` (default `--capture=fd`) it
|
||||||
|
hangs forever.
|
||||||
|
|
||||||
|
DIAG timeline for the root pytest PID (with `-s`
|
||||||
|
implied from later verification):
|
||||||
|
|
||||||
|
```
|
||||||
|
tractor_test: about to trio.run(_main)
|
||||||
|
open_root_actor: async_main task started, yielding to test body
|
||||||
|
_main: about to await wrapped test fn
|
||||||
|
_main: wrapped RETURNED cleanly ← test body completed!
|
||||||
|
open_root_actor: about to actor.cancel(None)
|
||||||
|
Actor.cancel ENTER req_chan=False
|
||||||
|
Actor.cancel RETURN
|
||||||
|
open_root_actor: actor.cancel RETURNED
|
||||||
|
open_root_actor: outer FINALLY
|
||||||
|
open_root_actor: finally END (returning from ctxmgr)
|
||||||
|
tractor_test: trio.run FINALLY (returned or raised) ← trio.run fully returned!
|
||||||
|
```
|
||||||
|
|
||||||
|
`trio.run()` fully returns. The test body itself
|
||||||
|
completes successfully (pytest.raises absorbed the
|
||||||
|
expected `BaseExceptionGroup`). What blocks is
|
||||||
|
**pytest's own stdout/stderr capture** — under
|
||||||
|
`--capture=fd` default, pytest replaces the parent
|
||||||
|
process's fd 1,2 with pipe write-ends it's reading
|
||||||
|
from. Fork children inherit those pipe fds
|
||||||
|
(because `_close_inherited_fds` correctly preserves
|
||||||
|
stdio). High-volume subactor error-log tracebacks
|
||||||
|
(7+ actors each logging multiple
|
||||||
|
`RemoteActorError`/`ExceptionGroup` tracebacks on
|
||||||
|
the error-propagation cascade) fill the 64KB Linux
|
||||||
|
pipe buffer. Subactor writes block. Subactor can't
|
||||||
|
progress. Process doesn't exit. Parent's
|
||||||
|
`_ForkedProc.wait` (now pidfd-based and
|
||||||
|
cancellable, but nothing's cancelling here since
|
||||||
|
the test body already completed) keeps the pipe
|
||||||
|
reader alive... but pytest isn't draining its end
|
||||||
|
fast enough because test-teardown/fixture-cleanup
|
||||||
|
is in progress.
|
||||||
|
|
||||||
|
**Actually** the exact mechanism is slightly
|
||||||
|
different: pytest's capture fixture MIGHT be
|
||||||
|
actively reading, but faster-than-writer subactors
|
||||||
|
overflow its internal buffer. Or pytest might be
|
||||||
|
blocked itself on the finalization step.
|
||||||
|
|
||||||
|
Either way, `-s` conclusively fixes it.
|
||||||
|
|
||||||
|
### Why I ruled this out earlier (and shouldn't have)
|
||||||
|
|
||||||
|
Earlier in this investigation I tested
|
||||||
|
`test_nested_multierrors` with/without `-s` and
|
||||||
|
both hung. That's because AT THAT TIME, fixes 1-4
|
||||||
|
weren't all in place yet. The test was hanging at
|
||||||
|
multiple deeper levels long before reaching the
|
||||||
|
"generate lots of error-log output" phase. Once
|
||||||
|
the cascade actually tore down cleanly, enough
|
||||||
|
output was produced to hit the capture-pipe limit.
|
||||||
|
|
||||||
|
**Classic order-of-operations mistake in
|
||||||
|
debugging:** ruling something out too early based
|
||||||
|
on a test that was actually failing for a
|
||||||
|
different reason.
|
||||||
|
|
||||||
|
### Fix direction (next session)
|
||||||
|
|
||||||
|
Redirect subactor stdout/stderr to `/dev/null` (or
|
||||||
|
a session-scoped log file) in the fork-child
|
||||||
|
prelude, right after `_close_inherited_fds()`. This
|
||||||
|
severs the inherited pytest-capture pipes and lets
|
||||||
|
subactor output flow elsewhere. Under normal
|
||||||
|
production use (non-pytest), stdout/stderr would
|
||||||
|
be the TTY — we'd want to keep that. So the
|
||||||
|
redirect should be conditional or opt-in via the
|
||||||
|
`child_sigint`/proc_kwargs flag family.
|
||||||
|
|
||||||
|
Alternative: document as a gotcha and recommend
|
||||||
|
`pytest -s` for any tests using the
|
||||||
|
`subint_forkserver` backend with multi-level actor
|
||||||
|
trees. Simpler, user-visible, no code change.
|
||||||
|
|
||||||
|
### Current state
|
||||||
|
|
||||||
|
- Skip-mark on `test_nested_multierrors[subint_forkserver]`
|
||||||
|
restored with reason pointing here.
|
||||||
|
- Test confirmed passing with `-s` after all 4
|
||||||
|
cascade fixes applied.
|
||||||
|
- The 4 cascade fixes are NOT wasted — they're
|
||||||
|
correct hardening regardless of the capture-pipe
|
||||||
|
issue, AND without them we'd never reach the
|
||||||
|
"actually produces enough output to fill the
|
||||||
|
pipe" state.
|
||||||
|
|
||||||
## Stopgap (landed)
|
## Stopgap (landed)
|
||||||
|
|
||||||
`test_nested_multierrors` skip-marked under
|
`test_nested_multierrors` skip-marked under
|
||||||
|
|
|
||||||
|
|
@ -24,7 +24,7 @@ Part of this work should include,
|
||||||
is happening under the hood with how cpython implements subints.
|
is happening under the hood with how cpython implements subints.
|
||||||
|
|
||||||
* default configuration should encourage state isolation as with
|
* default configuration should encourage state isolation as with
|
||||||
with subprocs, but explicit public escape hatches to enable rigorously
|
subprocs, but explicit public escape hatches to enable rigorously
|
||||||
managed shm channels for high performance apps.
|
managed shm channels for high performance apps.
|
||||||
|
|
||||||
- all tests should be (able to be) parameterized to use the new
|
- all tests should be (able to be) parameterized to use the new
|
||||||
|
|
|
||||||
|
|
@ -446,15 +446,23 @@ def _process_alive(pid: int) -> bool:
|
||||||
return False
|
return False
|
||||||
|
|
||||||
|
|
||||||
# NOTE: was previously `@pytest.mark.xfail(strict=True, ...)`
|
# Regressed back to xfail: previously passed after the
|
||||||
# for the orphan-SIGINT hang documented in
|
# fork-child FD-hygiene fix in `_close_inherited_fds()`,
|
||||||
# `ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md`
|
# but the recent `wait_for_no_more_peers(move_on_after=3.0)`
|
||||||
# — now passes after the fork-child FD-hygiene fix in
|
# bound in `async_main`'s teardown added up to 3s to the
|
||||||
# `tractor.spawn._subint_forkserver._close_inherited_fds()`:
|
# orphan subactor's exit timeline, pushing it past the
|
||||||
# closing all inherited FDs (including the parent's IPC
|
# test's 10s poll window. Real fix requires making the
|
||||||
# listener + trio-epoll + wakeup-pipe FDs) lets the child's
|
# bounded wait faster when the actor is orphaned, or
|
||||||
# trio event loop respond cleanly to external SIGINT.
|
# increasing the test's poll window. See tracker doc
|
||||||
# Leaving the test in place as a regression guard.
|
# `ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md`.
|
||||||
|
@pytest.mark.xfail(
|
||||||
|
strict=True,
|
||||||
|
reason=(
|
||||||
|
'Regressed to xfail after `wait_for_no_more_peers` '
|
||||||
|
'bound added ~3s teardown latency. Needs either '
|
||||||
|
'faster orphan-side teardown or 15s test poll window.'
|
||||||
|
),
|
||||||
|
)
|
||||||
@pytest.mark.timeout(
|
@pytest.mark.timeout(
|
||||||
30,
|
30,
|
||||||
method='thread',
|
method='thread',
|
||||||
|
|
|
||||||
|
|
@ -455,14 +455,16 @@ async def spawn_and_error(
|
||||||
@pytest.mark.skipon_spawn_backend(
|
@pytest.mark.skipon_spawn_backend(
|
||||||
'subint_forkserver',
|
'subint_forkserver',
|
||||||
reason=(
|
reason=(
|
||||||
'Multi-level fork-spawn cancel cascade hang — '
|
'Passes cleanly with `pytest -s` (no stdout capture) '
|
||||||
'peer-channel `process_messages` loops do not '
|
'but hangs under default `--capture=fd` due to '
|
||||||
'exit on `service_tn.cancel_scope.cancel()`. '
|
'pytest-capture-pipe buffer fill from high-volume '
|
||||||
'See `ai/conc-anal/'
|
'subactor error-log traceback output inherited via fds '
|
||||||
|
'1,2 in fork children. Fix direction: redirect subactor '
|
||||||
|
'stdout/stderr to `/dev/null` in `_child_target` / '
|
||||||
|
'`_actor_child_main` so forkserver children don\'t hold '
|
||||||
|
'pytest\'s capture pipe open. See `ai/conc-anal/'
|
||||||
'subint_forkserver_test_cancellation_leak_issue.md` '
|
'subint_forkserver_test_cancellation_leak_issue.md` '
|
||||||
'for the full diagnosis + candidate fix directions. '
|
'"Update — pytest capture pipe is the final gate".'
|
||||||
'Drop this mark once the peer-chan-loop exit issue '
|
|
||||||
'is closed.'
|
|
||||||
),
|
),
|
||||||
)
|
)
|
||||||
@pytest.mark.timeout(
|
@pytest.mark.timeout(
|
||||||
|
|
|
||||||
|
|
@ -1973,7 +1973,25 @@ async def async_main(
|
||||||
f' {pformat(ipc_server._peers)}'
|
f' {pformat(ipc_server._peers)}'
|
||||||
)
|
)
|
||||||
log.runtime(teardown_report)
|
log.runtime(teardown_report)
|
||||||
await ipc_server.wait_for_no_more_peers()
|
# NOTE: bound the peer-clear wait — otherwise if any
|
||||||
|
# peer-channel handler is stuck (e.g. never got its
|
||||||
|
# cancel propagated due to a runtime bug), this wait
|
||||||
|
# blocks forever and deadlocks the whole actor-tree
|
||||||
|
# teardown cascade. 3s is enough for any graceful
|
||||||
|
# cancel-ack round-trip; beyond that we're in bug
|
||||||
|
# territory and need to proceed with local teardown
|
||||||
|
# so the parent's `_ForkedProc.wait()` can unblock.
|
||||||
|
# See `ai/conc-anal/
|
||||||
|
# subint_forkserver_test_cancellation_leak_issue.md`
|
||||||
|
# for the full diagnosis.
|
||||||
|
with trio.move_on_after(3.0) as _peers_cs:
|
||||||
|
await ipc_server.wait_for_no_more_peers()
|
||||||
|
if _peers_cs.cancelled_caught:
|
||||||
|
teardown_report += (
|
||||||
|
f'-> TIMED OUT waiting for peers to clear '
|
||||||
|
f'({len(ipc_server._peers)} still connected)\n'
|
||||||
|
)
|
||||||
|
log.warning(teardown_report)
|
||||||
|
|
||||||
teardown_report += (
|
teardown_report += (
|
||||||
'-]> all peer channels are complete.\n'
|
'-]> all peer channels are complete.\n'
|
||||||
|
|
|
||||||
|
|
@ -252,9 +252,13 @@ def _close_inherited_fds(
|
||||||
os.close(fd)
|
os.close(fd)
|
||||||
closed += 1
|
closed += 1
|
||||||
except OSError:
|
except OSError:
|
||||||
# fd was already closed (race with listdir) or
|
# fd was already closed (race with listdir) or otherwise
|
||||||
# otherwise unclosable — either is fine.
|
# unclosable — either is fine.
|
||||||
pass
|
log.exception(
|
||||||
|
f'Failed to close inherited fd in child ??\n'
|
||||||
|
f'{fd!r}\n'
|
||||||
|
)
|
||||||
|
|
||||||
return closed
|
return closed
|
||||||
|
|
||||||
|
|
||||||
|
|
@ -401,11 +405,17 @@ def fork_from_worker_thread(
|
||||||
try:
|
try:
|
||||||
os.close(rfd)
|
os.close(rfd)
|
||||||
except OSError:
|
except OSError:
|
||||||
pass
|
log.exception(
|
||||||
|
f'Failed to close PID-pipe read-fd in parent ??\n'
|
||||||
|
f'{rfd!r}\n'
|
||||||
|
)
|
||||||
try:
|
try:
|
||||||
os.close(wfd)
|
os.close(wfd)
|
||||||
except OSError:
|
except OSError:
|
||||||
pass
|
log.exception(
|
||||||
|
f'Failed to close PID-pipe write-fd in parent ??\n'
|
||||||
|
f'{wfd!r}\n'
|
||||||
|
)
|
||||||
raise RuntimeError(
|
raise RuntimeError(
|
||||||
f'subint-forkserver worker thread '
|
f'subint-forkserver worker thread '
|
||||||
f'{thread_name!r} did not return within '
|
f'{thread_name!r} did not return within '
|
||||||
|
|
@ -475,6 +485,13 @@ def run_subint_in_worker_thread(
|
||||||
_interpreters.exec(interp_id, bootstrap)
|
_interpreters.exec(interp_id, bootstrap)
|
||||||
except BaseException as e:
|
except BaseException as e:
|
||||||
err = e
|
err = e
|
||||||
|
log.exception(
|
||||||
|
f'Failed to .exec() in subint ??\n'
|
||||||
|
f'_interpreters.exec(\n'
|
||||||
|
f' interp_id={interp_id!r},\n'
|
||||||
|
f' bootstrap={bootstrap!r},\n'
|
||||||
|
f') => {err!r}\n'
|
||||||
|
)
|
||||||
|
|
||||||
worker: threading.Thread = threading.Thread(
|
worker: threading.Thread = threading.Thread(
|
||||||
target=_drive,
|
target=_drive,
|
||||||
|
|
|
||||||
|
|
@ -246,7 +246,11 @@ async def trio_proc(
|
||||||
await proc.wait()
|
await proc.wait()
|
||||||
|
|
||||||
await debug.maybe_wait_for_debugger(
|
await debug.maybe_wait_for_debugger(
|
||||||
child_in_debug=get_runtime_vars().get(
|
# NOTE: use the child's `_runtime_vars`
|
||||||
|
# (the fn-arg dict shipped via `SpawnSpec`)
|
||||||
|
# — NOT `get_runtime_vars()` which returns
|
||||||
|
# the *parent's* live runtime state.
|
||||||
|
child_in_debug=_runtime_vars.get(
|
||||||
'_debug_mode', False
|
'_debug_mode', False
|
||||||
),
|
),
|
||||||
header_msg=(
|
header_msg=(
|
||||||
|
|
|
||||||
Loading…
Reference in New Issue