From b71705bdcd480adacd7ef4aa9528a613fd16e2b3 Mon Sep 17 00:00:00 2001 From: goodboy Date: Thu, 23 Apr 2026 15:21:41 -0400 Subject: [PATCH] Refine `subint_forkserver` nested-cancel hang diagnosis MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Major rewrite of `subint_forkserver_test_cancellation_leak_issue.md` after empirical investigation revealed the earlier "descendant-leak + missing tree-kill" diagnosis conflated two unrelated symptoms: 1. **5-zombie leak holding `:1616`** — turned out to be a self-inflicted cleanup bug: `pkill`-ing a bg pytest task (SIGTERM/SIGKILL, no SIGINT) skipped the SC graceful cancel cascade entirely. Codified the real fix — SIGINT-first ladder w/ bounded wait before SIGKILL — in e5e2afb5 (`run-tests` SKILL) and `feedback_sc_graceful_cancel_first.md`. 2. **`test_nested_multierrors[subint_forkserver]` hangs indefinitely** — the actual backend bug, and it's a deadlock not a leak. Deats, - new diagnosis: all 5 procs are kernel-`S` in `do_epoll_wait`; pytest-main's trio-cache workers are in `os.waitpid` waiting for children that are themselves waiting on IPC that never arrives — graceful `Portal.cancel_actor` cascade never reaches its targets - tree-structure evidence: asymmetric depth across two identical `run_in_actor` calls — child 1 (3 threads) spawns both its grandchildren; child 2 (1 thread) never completes its first nursery `run_in_actor`. Smells like a race on fork- inherited state landing differently per spawn ordering - new hypothesis: `os.fork()` from a subactor inherits the ROOT parent's IPC listener FDs transitively. Grandchildren end up with three overlapping FD sets (own + direct-parent + root), so IPC routing becomes ambiguous. Predicts bug scales with fork depth — matches reality: single- level spawn works, multi-level hangs - ruled out: `_ForkedProc.kill()` tree-kill (never reaches hard-kill path), `:1616` contention (fixed by `reg_addr` fixture wiring), GIL starvation (each subactor has its own OS process+GIL), child-side KBI absorption (`_trio_main` only catches KBI at `trio.run()` callsite, reached only on trio-loop exit) - four fix directions ranked: (1) blanket post-fork `closerange()`, (2) `FD_CLOEXEC` + audit, (3) targeted FD cleanup via `actor.ipc_server` handle, (4) `os.posix_spawn` w/ `file_actions`. Vote: (3) — surgical, doesn't break the "no exec" design of `subint_forkserver` - standalone repro added (`spawn_and_error(breadth= 2, depth=1)` under `trio.fail_after(20)`) - stopgap: skip `test_nested_multierrors` + multi- level-spawn tests under the backend via `@pytest.mark.skipon_spawn_backend(...)` until fix lands Killing the "tree-kill descendants" fix-direction section: it addressed a bug that didn't exist. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code --- ...forkserver_test_cancellation_leak_issue.md | 436 ++++++++++++------ 1 file changed, 302 insertions(+), 134 deletions(-) diff --git a/ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md b/ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md index 987b2b52..4ca722cc 100644 --- a/ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md +++ b/ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md @@ -1,165 +1,333 @@ -# `subint_forkserver` backend leaks subactor descendants in `test_cancellation.py` +# `subint_forkserver` backend: `test_cancellation.py` multi-level cancel cascade hang Follow-up tracker: surfaced while wiring the new `subint_forkserver` spawn backend into the full tractor -test matrix (step 2 of the post-backend-lands plan; -see also -`ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md`). +test matrix (step 2 of the post-backend-lands plan). +See also +`ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md` +— sibling tracker for a different forkserver-teardown +class which probably shares the same fundamental root +cause (fork-FD-inheritance across nested spawns). ## TL;DR -Running `tests/test_cancellation.py` under -`--spawn-backend=subint_forkserver` reproducibly leaks -**exactly 5 `subint-forkserv` comm-named child processes** -after the pytest session exits. Both previously-run -sessions produced the same 5-process signature — not a -flake. Each leaked process holds a `LISTEN` on the -default registry TCP addr (`127.0.0.1:1616`), which -poisons any subsequent tractor test session that -defaults to that addr. +`tests/test_cancellation.py::test_nested_multierrors[subint_forkserver]` +hangs indefinitely under our new backend. The hang is +**inside the graceful IPC cancel cascade** — every actor +in the multi-level tree parks in `epoll_wait` waiting +for IPC messages that never arrive. Not a hard-kill / +tree-reap issue (we don't reach the hard-kill fallback +path at all). -## Stopgap (not the real fix) +Working hypothesis (unverified): **`os.fork()` from a +subactor inherits the root parent's IPC listener socket +FDs**. When a first-level subactor forkserver-spawns a +grandchild, that grandchild inherits both its direct +spawner's FDs AND the root's FDs — IPC message routing +becomes ambiguous (or silently sends to the wrong +channel), so the cancel cascade can't reach its target. -Multiple tests in `test_cancellation.py` were calling -`tractor.open_nursery()` **without** passing -`registry_addrs=[reg_addr]`, i.e. falling back on the -default `:1616`. The commit accompanying this doc wires -the `reg_addr` fixture through those tests so each run -gets a session-unique port — leaked zombies can no -longer poison **other** tests (they hold their own -unique port instead). +## Corrected diagnosis vs. earlier draft -Tests touched (in `tests/test_cancellation.py`): +An earlier version of this doc claimed the root cause +was **"forkserver teardown doesn't tree-kill +descendants"** (SIGKILL only reaches the direct child, +grandchildren survive and hold TCP `:1616`). That +diagnosis was **wrong**, caused by conflating two +observations: -- `test_cancel_infinite_streamer` -- `test_some_cancels_all` -- `test_nested_multierrors` -- `test_cancel_via_SIGINT` -- `test_cancel_via_SIGINT_other_task` +1. *5-zombie leak holding :1616* — happened in my own + workflow when I aborted a bg pytest task with + `pkill` (SIGTERM/SIGKILL, not SIGINT). The abrupt + kill skipped the graceful `ActorNursery.__aexit__` + cancel cascade entirely, orphaning descendants to + init. **This was my cleanup bug, not a forkserver + teardown bug.** Codified the fix (SIGINT-first + + bounded wait before SIGKILL) in + `feedback_sc_graceful_cancel_first.md` + + `.claude/skills/run-tests/SKILL.md`. +2. *`test_nested_multierrors` hangs indefinitely* — + the real, separate, forkserver-specific bug + captured by this doc. -This is a **suite-hygiene fix** — it doesn't close the -actual leak; it just stops the leak from blast-radiusing. -Zombie descendants still accumulate per run. +The two symptoms are unrelated. The tree-kill / setpgrp +fix direction proposed earlier would not help (1) (SC- +graceful-cleanup is the right answer there) and would +not help (2) (the hang is in the cancel cascade, not +in the hard-kill fallback). -## The real bug (unfixed) +## Symptom -`subint_forkserver_proc`'s teardown — `_ForkedProc.kill()` -(plain `os.kill(SIGKILL)` to the direct child pid) + -`proc.wait()` — does **not** reap grandchildren or -deeper descendants. When a cancellation test causes a -multi-level actor tree to tear down, the direct child -dies but its own children survive and get reparented to -init (PID 1), where they stay running with their -inherited FDs (including the registry listen socket). +Reproducer (py3.14, clean env): -**Symptom on repro:** +```sh +# preflight: ensure clean env +ss -tlnp 2>/dev/null | grep ':1616' && echo 'FOUL — cleanup first!' || echo 'clean' -``` -$ ss -tlnp 2>/dev/null | grep ':1616' -LISTEN 0 4096 127.0.0.1:1616 0.0.0.0:* \ - users:(("subint-forkserv",pid=211595,fd=17), - ("subint-forkserv",pid=211585,fd=17), - ("subint-forkserv",pid=211583,fd=17), - ("subint-forkserv",pid=211576,fd=17), - ("subint-forkserv",pid=211572,fd=17)) - -$ for p in 211572 211576 211583 211585 211595; do - cat /proc/$p/cmdline | tr '\0' ' '; echo; done ./py314/bin/python -m pytest --spawn-backend=subint_forkserver \ - tests/test_cancellation.py --timeout=30 --timeout-method=signal \ - --tb=no -q --no-header -... (x5, all same cmdline — inherited from fork) + 'tests/test_cancellation.py::test_nested_multierrors[subint_forkserver]' \ + --timeout=30 --timeout-method=thread --tb=short -v ``` -All 5 share the pytest cmdline because `os.fork()` -without `exec()` preserves the parent's argv. Their -comm-name (`subint-forkserv`) is the `thread_name` we -pass to the fork-worker thread in -`tractor.spawn._subint_forkserver.fork_from_worker_thread`. +Expected: `pytest-timeout` fires at 30s with a thread- +dump banner, but the process itself **remains alive +after timeout** and doesn't unwedge on subsequent +SIGINT. Requires SIGKILL to reap. -## Why 5? +## Evidence (tree structure at hang point) -Not confirmed; guess is 5 = the parametrize cardinality -of one of the leaky tests (e.g. `test_some_cancels_all` -has 5 parametrize cases). Each param-case spawns a -nested tree; each leaks exactly one descendant. Worth -verifying by running each parametrize-case individually -and counting leaked procs per case. +All 5 processes are kernel-level `S` (sleeping) in +`do_epoll_wait` (trio's event loop waiting on I/O): -## Ruled out - -- **`:1616` collision from a different repo** (e.g. - piker): `/proc/$pid/cmdline` + `cwd` both resolve to - the tractor repo's `py314/` venv for all 5. These are - definitively spawned by our test run. -- **Parent-side `_ForkedProc.wait()` regressed**: the - direct child's teardown completes cleanly (exit-code - captured, `waitpid` returns); the 5 survivors are - deeper-descendants whose parent-side shim has no - handle on them. So the bug isn't in - `_ForkedProc.wait()` — it's in the lack of tree- - level descendant enumeration + reaping during nursery - teardown. - -## Likely fix directions - -1. **Process-group-scoped spawn + tree kill.** Put each - forkserver-spawned subactor into its own process - group (`os.setpgrp()` in the fork child), then on - teardown `os.killpg(pgid, SIGKILL)` to reap the - whole tree atomically. Simplest, most surgical. -2. **Subreaper registration.** Use - `PR_SET_CHILD_SUBREAPER` on the tractor root so - orphaned grandchildren reparent to the root rather - than init — then we can `waitpid` them from the - parent-side nursery teardown. More invasive. -3. **Explicit descendant enumeration at teardown.** - In `subint_forkserver_proc`'s finally block, walk - `/proc//task/*/children` before issuing SIGKILL - to build a descendant-pid set; then kill + reap all - of them. Fragile (Linux-only, proc-fs-scan race). - -Vote: **(1)** — clean, POSIX-standard, aligns with how -`subprocess.Popen` (and by extension `trio.lowlevel. -open_process`) handle tree-kill semantics on -kwargs-supplied `start_new_session=True`. - -## Reproducer - -```sh -# before: ensure clean env -ss -tlnp 2>/dev/null | grep ':1616' || echo 'clean' - -# run the leaky tests -./py314/bin/python -m pytest \ - --spawn-backend=subint_forkserver \ - tests/test_cancellation.py \ - --timeout=30 --timeout-method=signal --tb=no -q --no-header - -# observe: 5 leaked children now holding :1616 -ss -tlnp 2>/dev/null | grep ':1616' +``` +PID PPID THREADS NAME ROLE +333986 1 2 subint-forkserv pytest main (the test body) +333993 333986 3 subint-forkserv "child 1" spawner subactor + 334003 333993 1 subint-forkserv grandchild errorer under child-1 + 334014 333993 1 subint-forkserv grandchild errorer under child-1 +333999 333986 1 subint-forkserv "child 2" spawner subactor (NO grandchildren!) ``` -Expected output: `subint-forkserv` processes listed as -listeners on `:1616`. Cleanup: +### Asymmetric tree depth -```sh -pkill -9 -f \ - "$(pwd)/py314/bin/python.*pytest.*spawn-backend=subint_forkserver" +The test's `spawn_and_error(breadth=2, depth=3)` should +have BOTH direct children spawning 2 grandchildren +each, going 3 levels deep. Reality: + +- Child 1 (333993, 3 threads) DID spawn its two + grandchildren as expected — fully booted trio + runtime. +- Child 2 (333999, 1 thread) did NOT spawn any + grandchildren — clearly never completed its + nursery's first `run_in_actor`. Its 1-thread state + suggests the runtime never fully booted (no trio + worker threads for `waitpid`/IPC). + +This asymmetry is the key clue: the two direct +children started identically but diverged. Probably a +race around fork-inherited state (listener FDs, +subactor-nursery channel state) that happens to land +differently depending on spawn ordering. + +### Parent-side state + +Thread-dump of pytest main (333986) at the hang: + +- Main trio thread — parked in + `trio._core._io_epoll.get_events` (epoll_wait on + its event loop). Waiting for IPC from children. +- Two trio-cache worker threads — each parked in + `outcome.capture(sync_fn)` calling + `os.waitpid(child_pid, 0)`. These are our + `_ForkedProc.wait()` off-loads. They're waiting for + the direct children to exit — but children are + stuck in their own epoll_wait waiting for IPC from + the parent. + +**It's a deadlock, not a leak:** the parent is +correctly running `soft_kill(proc, _ForkedProc.wait, +portal)` (graceful IPC cancel via +`Portal.cancel_actor()`), but the children never +acknowledge the cancel message (or the message never +reaches them through the tangled post-fork IPC). + +## What's NOT the cause (ruled out) + +- **`_ForkedProc.kill()` only SIGKILLs direct pid / + missing tree-kill**: doesn't apply — we never reach + the hard-kill path. The deadlock is in the graceful + cancel cascade. +- **Port `:1616` contention**: ruled out after the + `reg_addr` fixture-wiring fix; each test session + gets a unique port now. +- **GIL starvation / SIGINT pipe filling** (class-A, + `subint_sigint_starvation_issue.md`): doesn't apply + — each subactor is its own OS process with its own + GIL (not legacy-config subint). +- **Child-side `_trio_main` absorbing KBI**: grep + confirmed; `_trio_main` only catches KBI at the + `trio.run()` callsite, which is reached only if the + trio loop exits normally. The children here never + exit trio.run() — they're wedged inside. + +## Hypothesis: FD inheritance across nested forks + +`subint_forkserver_proc` calls +`fork_from_worker_thread()` which ultimately does +`os.fork()` from a dedicated worker thread. Standard +Linux/POSIX fork semantics: **the child inherits ALL +open FDs from the parent**, including listener +sockets, epoll fds, trio wakeup pipes, and the +parent's IPC channel sockets. + +At root-actor fork-spawn time, the root's IPC server +listener FDs are open in the parent. Those get +inherited by child 1. Child 1 then forkserver-spawns +its OWN subactor (grandchild). The grandchild +inherits FDs from child 1 — but child 1's address +space still contains **the root's IPC listener FDs +too** (inherited at first fork). So the grandchild +has THREE sets of FDs: + +1. Its own (created after becoming a subactor). +2. Its direct parent child-1's. +3. The ROOT's (grandparent's) — inherited transitively. + +IPC message routing may be ambiguous in this tangled +state. Or a listener socket that the root thinks it +owns is actually open in multiple processes, and +messages sent to it go to an arbitrary one. That +would exactly match the observed "graceful cancel +never propagates". + +This hypothesis predicts the bug **scales with fork +depth**: single-level forkserver spawn +(`test_subint_forkserver_spawn_basic`) works +perfectly, but any test that spawns a second level +deadlocks. Matches observations so far. + +## Fix directions (to validate) + +### 1. `close_fds=True` equivalent in `fork_from_worker_thread()` + +`subprocess.Popen` / `trio.lowlevel.open_process` have +`close_fds=True` by default on POSIX — they +enumerate open FDs in the child post-fork and close +everything except stdio + any explicitly-passed FDs. +Our raw `os.fork()` doesn't. Adding the equivalent to +our `_worker` prelude would isolate each fork +generation's FD set. + +Implementation sketch in +`tractor.spawn._subint_forkserver.fork_from_worker_thread._worker`: + +```python +def _worker() -> None: + pid: int = os.fork() + if pid == 0: + # CHILD: close inherited FDs except stdio + the + # pid-pipe we just opened. + keep: set[int] = {0, 1, 2, rfd, wfd} + import resource + soft, _ = resource.getrlimit(resource.RLIMIT_NOFILE) + os.closerange(3, soft) # blunt; or enumerate /proc/self/fd + # ... then child_target() as before ``` +Problem: overly aggressive — closes FDs the +grandchild might legitimately need (e.g. its parent's +IPC channel for the spawn-spec handshake, if we rely +on that). Needs thought about which FDs are +"inheritable and safe" vs. "inherited by accident". + +### 2. Cloexec on tractor's own FDs + +Set `FD_CLOEXEC` on tractor-created sockets (listener +sockets, IPC channel sockets, pipes). This flag +causes automatic close on `execve`, but since we +`fork()` without `exec()`, this alone doesn't help. +BUT — combined with a child-side explicit close- +non-cloexec loop, it gives us a way to mark "my +private FDs" vs. "safe to inherit". Most robust, but +requires tractor-wide audit. + +### 3. Explicit FD cleanup in `_ForkedProc`/`_child_target` + +Have `subint_forkserver_proc`'s `_child_target` +closure explicitly close the parent-side IPC listener +FDs before calling `_actor_child_main`. Requires +being able to enumerate "the parent's listener FDs +that the child shouldn't keep" — plausible via +`Actor.ipc_server`'s socket objects. + +### 4. Use `os.posix_spawn` with explicit `file_actions` + +Instead of raw `os.fork()`, use `os.posix_spawn()` +which supports explicit file-action specifications +(close this FD, dup2 that FD). Cleaner semantics, but +probably incompatible with our "no exec" requirement +(subint_forkserver is a fork-without-exec design). + +**Likely correct answer: (3) — targeted FD cleanup +via `actor.ipc_server` handle.** (1) is too blunt, +(2) is too wide-ranging, (4) changes the spawn +mechanism. + +## Reproducer (standalone, no pytest) + +```python +# save as /tmp/forkserver_nested_hang_repro.py (py3.14+) +import trio, tractor + +async def assert_err(): + assert 0 + +async def spawn_and_error(breadth: int = 2, depth: int = 1): + async with tractor.open_nursery() as n: + for i in range(breadth): + if depth > 0: + await n.run_in_actor( + spawn_and_error, + breadth=breadth, + depth=depth - 1, + name=f'spawner_{i}_{depth}', + ) + else: + await n.run_in_actor( + assert_err, + name=f'errorer_{i}', + ) + +async def _main(): + async with tractor.open_nursery() as n: + for i in range(2): + await n.run_in_actor( + spawn_and_error, + name=f'top_{i}', + breadth=2, + depth=1, + ) + +if __name__ == '__main__': + from tractor.spawn._spawn import try_set_start_method + try_set_start_method('subint_forkserver') + with trio.fail_after(20): + trio.run(_main) +``` + +Expected (current): hangs on `trio.fail_after(20)` +— children never ack the error-propagation cancel +cascade. Pattern: top 2 direct children, 4 +grandchildren, 1 errorer deadlocks while trying to +unwind through its parent chain. + +After fix: `trio.TooSlowError`-free completion; the +root's `open_nursery` receives the +`BaseExceptionGroup` containing the `AssertionError` +from the errorer and unwinds cleanly. + +## Stopgap (landed) + +Until the fix lands, `test_nested_multierrors` + +related multi-level-spawn tests can be skip-marked +under `subint_forkserver` via +`@pytest.mark.skipon_spawn_backend('subint_forkserver', +reason='...')`. Cross-ref this doc. + ## References -- `tractor/spawn/_subint_forkserver.py::_ForkedProc` - — the current teardown shim; PID-scoped, not tree- - scoped. +- `tractor/spawn/_subint_forkserver.py::fork_from_worker_thread` + — the primitive whose post-fork FD hygiene is + probably the culprit. - `tractor/spawn/_subint_forkserver.py::subint_forkserver_proc` - — the spawn backend whose `finally` block needs the - tree-kill fix. -- `tests/test_cancellation.py` — the surface where the - leak surfaces. + — the backend function that orchestrates the + graceful cancel path hitting this bug. +- `tractor/spawn/_subint_forkserver.py::_ForkedProc` + — the `trio.Process`-compatible shim; NOT the + failing component (confirmed via thread-dump). +- `tests/test_cancellation.py::test_nested_multierrors` + — the test that surfaced the hang. - `ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md` - — sibling tracker for a different forkserver-teardown - class (orphaned child doesn't respond to SIGINT); may - share root cause with this one once the fix lands. + — sibling hang class; probably same underlying + fork-FD-inheritance root cause. - tractor issue #379 — subint backend tracking.