Scrub inherited FDs in fork-child prelude

Implements fix-direction (1)/blunt-close-all-FDs from b71705bd (`subint_forkserver` nested-cancel hang diag), targeting the multi-level cancel-cascade deadlock in `test_nested_multierrors[subint_forkserver]`. The diagnosis doc voted for surgical FD cleanup via `actor.ipc_server` handle as the cleanest approach, but going blunt is actually the right call: after `os.fork()`, the child immediately enters `_actor_child_main()` which opens its OWN IPC sockets / wakeup-fd / epoll-fd / etc. — none of the parent's FDs are needed. Closing everything except stdio is safe AND defends against future listener/IPC additions to the parent inheriting silently into children. Deats, - new `_close_inherited_fds(keep={0,1,2}) -> int` helper. Linux fast-path enumerates `/proc/self/fd`; POSIX fallback uses `RLIMIT_NOFILE` range. Matches the stdlib `subprocess._posixsubprocess.close_fds` strategy. Returns close-count for sanity logging - wire into `fork_from_worker_thread._worker()`'s post-fork child prelude — runs immediately after the pid-pipe `os.close(rfd/wfd)`, before the user `child_target` callable executes - docstring cross-refs the diagnosis doc + spells out the FD-inheritance-cascade mechanism and why the close-all approach is safe for our spawn shape Validation pending: re-run `test_nested_multierrors[subint_forkserver]` to confirm the deadlock is gone. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
Refine `subint_forkserver` nested-cancel hang diagnosis
2026-04-23 15:30:39 -04:00 · 2026-04-23 15:21:41 -04:00
2 changed files with 370 additions and 135 deletions
--- a/ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md
+++ b/ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md
@ -1,165 +1,333 @@
-# `subint_forkserver` backend leaks subactor descendants in `test_cancellation.py`
+# `subint_forkserver` backend: `test_cancellation.py` multi-level cancel cascade hang
 Follow-up tracker: surfaced while wiring the new
 `subint_forkserver` spawn backend into the full tractor
-test matrix (step 2 of the post-backend-lands plan;
+test matrix (step 2 of the post-backend-lands plan).
-see also
+See also
-`ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md`).
+`ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md`
 — sibling tracker for a different forkserver-teardown
 class which probably shares the same fundamental root
 cause (fork-FD-inheritance across nested spawns).
 ## TL;DR
-Running `tests/test_cancellation.py` under
+`tests/test_cancellation.py::test_nested_multierrors[subint_forkserver]`
-`--spawn-backend=subint_forkserver` reproducibly leaks
+hangs indefinitely under our new backend. The hang is
-**exactly 5 `subint-forkserv` comm-named child processes**
+**inside the graceful IPC cancel cascade** — every actor
-after the pytest session exits. Both previously-run
+in the multi-level tree parks in `epoll_wait` waiting
-sessions produced the same 5-process signature — not a
+for IPC messages that never arrive. Not a hard-kill /
-flake. Each leaked process holds a `LISTEN` on the
+tree-reap issue (we don't reach the hard-kill fallback
-default registry TCP addr (`127.0.0.1:1616`), which
+path at all).
 poisons any subsequent tractor test session that
 defaults to that addr.
-## Stopgap (not the real fix)
+Working hypothesis (unverified): **`os.fork()` from a
 subactor inherits the root parent's IPC listener socket
 FDs**. When a first-level subactor forkserver-spawns a
 grandchild, that grandchild inherits both its direct
 spawner's FDs AND the root's FDs — IPC message routing
 becomes ambiguous (or silently sends to the wrong
 channel), so the cancel cascade can't reach its target.
-Multiple tests in `test_cancellation.py` were calling
+## Corrected diagnosis vs. earlier draft
 `tractor.open_nursery()` **without** passing
 `registry_addrs=[reg_addr]`, i.e. falling back on the
 default `:1616`. The commit accompanying this doc wires
 the `reg_addr` fixture through those tests so each run
 gets a session-unique port — leaked zombies can no
 longer poison **other** tests (they hold their own
 unique port instead).
-Tests touched (in `tests/test_cancellation.py`):
+An earlier version of this doc claimed the root cause
 was **"forkserver teardown doesn't tree-kill
 descendants"** (SIGKILL only reaches the direct child,
 grandchildren survive and hold TCP `:1616`). That
 diagnosis was **wrong**, caused by conflating two
 observations:
- `test_cancel_infinite_streamer`
+1. *5-zombie leak holding :1616* — happened in my own
- `test_some_cancels_all`
+   workflow when I aborted a bg pytest task with
- `test_nested_multierrors`
+   `pkill` (SIGTERM/SIGKILL, not SIGINT). The abrupt
- `test_cancel_via_SIGINT`
+   kill skipped the graceful `ActorNursery.__aexit__`
- `test_cancel_via_SIGINT_other_task`
+   cancel cascade entirely, orphaning descendants to
   init. **This was my cleanup bug, not a forkserver
   teardown bug.** Codified the fix (SIGINT-first +
   bounded wait before SIGKILL) in
   `feedback_sc_graceful_cancel_first.md` +
   `.claude/skills/run-tests/SKILL.md`.
 2. *`test_nested_multierrors` hangs indefinitely* —
   the real, separate, forkserver-specific bug
   captured by this doc.
-This is a **suite-hygiene fix** — it doesn't close the
+The two symptoms are unrelated. The tree-kill / setpgrp
-actual leak; it just stops the leak from blast-radiusing.
+fix direction proposed earlier would not help (1) (SC-
-Zombie descendants still accumulate per run.
+graceful-cleanup is the right answer there) and would
 not help (2) (the hang is in the cancel cascade, not
 in the hard-kill fallback).
-## The real bug (unfixed)
+## Symptom
-`subint_forkserver_proc`'s teardown — `_ForkedProc.kill()`
+Reproducer (py3.14, clean env):
 (plain `os.kill(SIGKILL)` to the direct child pid) +
 `proc.wait()` — does **not** reap grandchildren or
 deeper descendants. When a cancellation test causes a
 multi-level actor tree to tear down, the direct child
 dies but its own children survive and get reparented to
 init (PID 1), where they stay running with their
 inherited FDs (including the registry listen socket).
-**Symptom on repro:**
+```sh
 # preflight: ensure clean env
 ss -tlnp 2>/dev/null | grep ':1616' && echo 'FOUL — cleanup first!' || echo 'clean'
 ```
 $ ss -tlnp 2>/dev/null | grep ':1616'
 LISTEN 0 4096 127.0.0.1:1616 0.0.0.0:* \
    users:(("subint-forkserv",pid=211595,fd=17),
           ("subint-forkserv",pid=211585,fd=17),
           ("subint-forkserv",pid=211583,fd=17),
           ("subint-forkserv",pid=211576,fd=17),
           ("subint-forkserv",pid=211572,fd=17))
 $ for p in 211572 211576 211583 211585 211595; do
    cat /proc/$p/cmdline | tr '\0' ' '; echo; done
 ./py314/bin/python -m pytest --spawn-backend=subint_forkserver \
-  tests/test_cancellation.py --timeout=30 --timeout-method=signal \
+  'tests/test_cancellation.py::test_nested_multierrors[subint_forkserver]' \
-  --tb=no -q --no-header
+  --timeout=30 --timeout-method=thread --tb=short -v
 ... (x5, all same cmdline — inherited from fork)
 ```
-All 5 share the pytest cmdline because `os.fork()`
+Expected: `pytest-timeout` fires at 30s with a thread-
-without `exec()` preserves the parent's argv. Their
+dump banner, but the process itself **remains alive
-comm-name (`subint-forkserv`) is the `thread_name` we
+after timeout** and doesn't unwedge on subsequent
-pass to the fork-worker thread in
+SIGINT. Requires SIGKILL to reap.
 `tractor.spawn._subint_forkserver.fork_from_worker_thread`.
-## Why 5?
+## Evidence (tree structure at hang point)
-Not confirmed; guess is 5 = the parametrize cardinality
+All 5 processes are kernel-level `S` (sleeping) in
-of one of the leaky tests (e.g. `test_some_cancels_all`
+`do_epoll_wait` (trio's event loop waiting on I/O):
 has 5 parametrize cases). Each param-case spawns a
 nested tree; each leaks exactly one descendant. Worth
 verifying by running each parametrize-case individually
 and counting leaked procs per case.
-## Ruled out
+```
-
+PID     PPID    THREADS  NAME             ROLE
- **`:1616` collision from a different repo** (e.g.
+333986  1       2        subint-forkserv  pytest main (the test body)
-  piker): `/proc/$pid/cmdline` + `cwd` both resolve to
+333993  333986  3        subint-forkserv  "child 1" spawner subactor
-  the tractor repo's `py314/` venv for all 5. These are
+  334003 333993 1        subint-forkserv  grandchild errorer under child-1
-  definitively spawned by our test run.
+  334014 333993 1        subint-forkserv  grandchild errorer under child-1
- **Parent-side `_ForkedProc.wait()` regressed**: the
+333999  333986  1        subint-forkserv  "child 2" spawner subactor (NO grandchildren!)
  direct child's teardown completes cleanly (exit-code
  captured, `waitpid` returns); the 5 survivors are
  deeper-descendants whose parent-side shim has no
  handle on them. So the bug isn't in
  `_ForkedProc.wait()` — it's in the lack of tree-
  level descendant enumeration + reaping during nursery
  teardown.
 ## Likely fix directions
 1. **Process-group-scoped spawn + tree kill.** Put each
   forkserver-spawned subactor into its own process
   group (`os.setpgrp()` in the fork child), then on
   teardown `os.killpg(pgid, SIGKILL)` to reap the
   whole tree atomically. Simplest, most surgical.
 2. **Subreaper registration.** Use
   `PR_SET_CHILD_SUBREAPER` on the tractor root so
   orphaned grandchildren reparent to the root rather
   than init — then we can `waitpid` them from the
   parent-side nursery teardown. More invasive.
 3. **Explicit descendant enumeration at teardown.**
   In `subint_forkserver_proc`'s finally block, walk
   `/proc/<pid>/task/*/children` before issuing SIGKILL
   to build a descendant-pid set; then kill + reap all
   of them. Fragile (Linux-only, proc-fs-scan race).
 Vote: **(1)** — clean, POSIX-standard, aligns with how
 `subprocess.Popen` (and by extension `trio.lowlevel.
 open_process`) handle tree-kill semantics on
 kwargs-supplied `start_new_session=True`.
 ## Reproducer
 ```sh
 # before: ensure clean env
 ss -tlnp 2>/dev/null | grep ':1616' || echo 'clean'
 # run the leaky tests
 ./py314/bin/python -m pytest \
  --spawn-backend=subint_forkserver \
  tests/test_cancellation.py \
  --timeout=30 --timeout-method=signal --tb=no -q --no-header
 # observe: 5 leaked children now holding :1616
 ss -tlnp 2>/dev/null | grep ':1616'
 ```
-Expected output: `subint-forkserv` processes listed as
+### Asymmetric tree depth
 listeners on `:1616`. Cleanup:
-```sh
+The test's `spawn_and_error(breadth=2, depth=3)` should
-pkill -9 -f \
+have BOTH direct children spawning 2 grandchildren
-  "$(pwd)/py314/bin/python.*pytest.*spawn-backend=subint_forkserver"
+each, going 3 levels deep. Reality:
 - Child 1 (333993, 3 threads) DID spawn its two
  grandchildren as expected — fully booted trio
  runtime.
 - Child 2 (333999, 1 thread) did NOT spawn any
  grandchildren — clearly never completed its
  nursery's first `run_in_actor`. Its 1-thread state
  suggests the runtime never fully booted (no trio
  worker threads for `waitpid`/IPC).
 This asymmetry is the key clue: the two direct
 children started identically but diverged. Probably a
 race around fork-inherited state (listener FDs,
 subactor-nursery channel state) that happens to land
 differently depending on spawn ordering.
 ### Parent-side state
 Thread-dump of pytest main (333986) at the hang:
 - Main trio thread — parked in
  `trio._core._io_epoll.get_events` (epoll_wait on
  its event loop). Waiting for IPC from children.
 - Two trio-cache worker threads — each parked in
  `outcome.capture(sync_fn)` calling
  `os.waitpid(child_pid, 0)`. These are our
  `_ForkedProc.wait()` off-loads. They're waiting for
  the direct children to exit — but children are
  stuck in their own epoll_wait waiting for IPC from
  the parent.
 **It's a deadlock, not a leak:** the parent is
 correctly running `soft_kill(proc, _ForkedProc.wait,
 portal)` (graceful IPC cancel via
 `Portal.cancel_actor()`), but the children never
 acknowledge the cancel message (or the message never
 reaches them through the tangled post-fork IPC).
 ## What's NOT the cause (ruled out)
 - **`_ForkedProc.kill()` only SIGKILLs direct pid /
  missing tree-kill**: doesn't apply — we never reach
  the hard-kill path. The deadlock is in the graceful
  cancel cascade.
 - **Port `:1616` contention**: ruled out after the
  `reg_addr` fixture-wiring fix; each test session
  gets a unique port now.
 - **GIL starvation / SIGINT pipe filling** (class-A,
  `subint_sigint_starvation_issue.md`): doesn't apply
  — each subactor is its own OS process with its own
  GIL (not legacy-config subint).
 - **Child-side `_trio_main` absorbing KBI**: grep
  confirmed; `_trio_main` only catches KBI at the
  `trio.run()` callsite, which is reached only if the
  trio loop exits normally. The children here never
  exit trio.run() — they're wedged inside.
 ## Hypothesis: FD inheritance across nested forks
 `subint_forkserver_proc` calls
 `fork_from_worker_thread()` which ultimately does
 `os.fork()` from a dedicated worker thread. Standard
 Linux/POSIX fork semantics: **the child inherits ALL
 open FDs from the parent**, including listener
 sockets, epoll fds, trio wakeup pipes, and the
 parent's IPC channel sockets.
 At root-actor fork-spawn time, the root's IPC server
 listener FDs are open in the parent. Those get
 inherited by child 1. Child 1 then forkserver-spawns
 its OWN subactor (grandchild). The grandchild
 inherits FDs from child 1 — but child 1's address
 space still contains **the root's IPC listener FDs
 too** (inherited at first fork). So the grandchild
 has THREE sets of FDs:
 1. Its own (created after becoming a subactor).
 2. Its direct parent child-1's.
 3. The ROOT's (grandparent's) — inherited transitively.
 IPC message routing may be ambiguous in this tangled
 state. Or a listener socket that the root thinks it
 owns is actually open in multiple processes, and
 messages sent to it go to an arbitrary one. That
 would exactly match the observed "graceful cancel
 never propagates".
 This hypothesis predicts the bug **scales with fork
 depth**: single-level forkserver spawn
 (`test_subint_forkserver_spawn_basic`) works
 perfectly, but any test that spawns a second level
 deadlocks. Matches observations so far.
 ## Fix directions (to validate)
 ### 1. `close_fds=True` equivalent in `fork_from_worker_thread()`
 `subprocess.Popen` / `trio.lowlevel.open_process` have
 `close_fds=True` by default on POSIX — they
 enumerate open FDs in the child post-fork and close
 everything except stdio + any explicitly-passed FDs.
 Our raw `os.fork()` doesn't. Adding the equivalent to
 our `_worker` prelude would isolate each fork
 generation's FD set.
 Implementation sketch in
 `tractor.spawn._subint_forkserver.fork_from_worker_thread._worker`:
 ```python
 def _worker() -> None:
    pid: int = os.fork()
    if pid == 0:
        # CHILD: close inherited FDs except stdio + the
        # pid-pipe we just opened.
        keep: set[int] = {0, 1, 2, rfd, wfd}
        import resource
        soft, _ = resource.getrlimit(resource.RLIMIT_NOFILE)
        os.closerange(3, soft)  # blunt; or enumerate /proc/self/fd
        # ... then child_target() as before
 ```
 Problem: overly aggressive — closes FDs the
 grandchild might legitimately need (e.g. its parent's
 IPC channel for the spawn-spec handshake, if we rely
 on that). Needs thought about which FDs are
 "inheritable and safe" vs. "inherited by accident".
 ### 2. Cloexec on tractor's own FDs
 Set `FD_CLOEXEC` on tractor-created sockets (listener
 sockets, IPC channel sockets, pipes). This flag
 causes automatic close on `execve`, but since we
 `fork()` without `exec()`, this alone doesn't help.
 BUT — combined with a child-side explicit close-
 non-cloexec loop, it gives us a way to mark "my
 private FDs" vs. "safe to inherit". Most robust, but
 requires tractor-wide audit.
 ### 3. Explicit FD cleanup in `_ForkedProc`/`_child_target`
 Have `subint_forkserver_proc`'s `_child_target`
 closure explicitly close the parent-side IPC listener
 FDs before calling `_actor_child_main`. Requires
 being able to enumerate "the parent's listener FDs
 that the child shouldn't keep" — plausible via
 `Actor.ipc_server`'s socket objects.
 ### 4. Use `os.posix_spawn` with explicit `file_actions`
 Instead of raw `os.fork()`, use `os.posix_spawn()`
 which supports explicit file-action specifications
 (close this FD, dup2 that FD). Cleaner semantics, but
 probably incompatible with our "no exec" requirement
 (subint_forkserver is a fork-without-exec design).
 **Likely correct answer: (3) — targeted FD cleanup
 via `actor.ipc_server` handle.** (1) is too blunt,
 (2) is too wide-ranging, (4) changes the spawn
 mechanism.
 ## Reproducer (standalone, no pytest)
 ```python
 # save as /tmp/forkserver_nested_hang_repro.py  (py3.14+)
 import trio, tractor
 async def assert_err():
    assert 0
 async def spawn_and_error(breadth: int = 2, depth: int = 1):
    async with tractor.open_nursery() as n:
        for i in range(breadth):
            if depth > 0:
                await n.run_in_actor(
                    spawn_and_error,
                    breadth=breadth,
                    depth=depth - 1,
                    name=f'spawner_{i}_{depth}',
                )
            else:
                await n.run_in_actor(
                    assert_err,
                    name=f'errorer_{i}',
                )
 async def _main():
    async with tractor.open_nursery() as n:
        for i in range(2):
            await n.run_in_actor(
                spawn_and_error,
                name=f'top_{i}',
                breadth=2,
                depth=1,
            )
 if __name__ == '__main__':
    from tractor.spawn._spawn import try_set_start_method
    try_set_start_method('subint_forkserver')
    with trio.fail_after(20):
        trio.run(_main)
 ```
 Expected (current): hangs on `trio.fail_after(20)`
 — children never ack the error-propagation cancel
 cascade. Pattern: top 2 direct children, 4
 grandchildren, 1 errorer deadlocks while trying to
 unwind through its parent chain.
 After fix: `trio.TooSlowError`-free completion; the
 root's `open_nursery` receives the
 `BaseExceptionGroup` containing the `AssertionError`
 from the errorer and unwinds cleanly.
 ## Stopgap (landed)
 Until the fix lands, `test_nested_multierrors` +
 related multi-level-spawn tests can be skip-marked
 under `subint_forkserver` via
 `@pytest.mark.skipon_spawn_backend('subint_forkserver',
 reason='...')`. Cross-ref this doc.
 ## References
- `tractor/spawn/_subint_forkserver.py::_ForkedProc`
+- `tractor/spawn/_subint_forkserver.py::fork_from_worker_thread`
-  — the current teardown shim; PID-scoped, not tree-
+  — the primitive whose post-fork FD hygiene is
-  scoped.
+  probably the culprit.
 - `tractor/spawn/_subint_forkserver.py::subint_forkserver_proc`
-  — the spawn backend whose `finally` block needs the
+  — the backend function that orchestrates the
-  tree-kill fix.
+  graceful cancel path hitting this bug.
- `tests/test_cancellation.py` — the surface where the
+- `tractor/spawn/_subint_forkserver.py::_ForkedProc`
-  leak surfaces.
+  — the `trio.Process`-compatible shim; NOT the
  failing component (confirmed via thread-dump).
 - `tests/test_cancellation.py::test_nested_multierrors`
  — the test that surfaced the hang.
 - `ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md`
-  — sibling tracker for a different forkserver-teardown
+  — sibling hang class; probably same underlying
-  class (orphaned child doesn't respond to SIGINT); may
+  fork-FD-inheritance root cause.
  share root cause with this one once the fix lands.
 - tractor issue #379 — subint backend tracking.
--- a/tractor/spawn/_subint_forkserver.py
+++ b/tractor/spawn/_subint_forkserver.py
@ -195,6 +195,69 @@ except ImportError:
    _has_subints: bool = False
 def _close_inherited_fds(
    keep: frozenset[int] = frozenset({0, 1, 2}),
 ) -> int:
    '''
    Close every open file descriptor in the current process
    EXCEPT those in `keep` (default: stdio only).
    Intended as the first thing a post-`os.fork()` child runs
    after closing any communication pipes it knows about. This
    is the fork-child FD hygiene discipline that
    `subprocess.Popen(close_fds=True)` applies by default for
    its exec-based children, but which we have to implement
    ourselves because our `fork_from_worker_thread()` primitive
    deliberately does NOT exec.
    Why it matters
    --------------
    Without this, a forkserver-spawned subactor inherits the
    parent actor's IPC listener sockets, trio-epoll fd, trio
    wakeup-pipe, peer-channel sockets, etc. If that subactor
    then itself forkserver-spawns a grandchild, the grandchild
    inherits the FDs transitively from *both* its direct
    parent AND the root actor — IPC message routing becomes
    ambiguous and the cancel cascade deadlocks. See
    `ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md`
    for the full diagnosis + the empirical repro.
    Fresh children will open their own IPC sockets via
    `_actor_child_main()`, so they don't need any of the
    parent's FDs.
    Returns the count of fds that were successfully closed —
    useful for sanity-check logging at callsites.
    '''
    # Enumerate open fds via `/proc/self/fd` on Linux (the fast +
    # precise path); fall back to `RLIMIT_NOFILE` range close on
    # other platforms. Matches stdlib
    # `subprocess._posixsubprocess.close_fds` strategy.
    try:
        fd_names: list[str] = os.listdir('/proc/self/fd')
        candidates: list[int] = [
            int(n) for n in fd_names if n.isdigit()
        ]
    except (FileNotFoundError, PermissionError):
        import resource
        soft, _ = resource.getrlimit(resource.RLIMIT_NOFILE)
        candidates = list(range(3, soft))
    closed: int = 0
    for fd in candidates:
        if fd in keep:
            continue
        try:
            os.close(fd)
            closed += 1
        except OSError:
            # fd was already closed (race with listdir) or
            # otherwise unclosable — either is fine.
            pass
    return closed
 def _format_child_exit(
    status: int,
 ) -> str:
@ -302,9 +365,13 @@ def fork_from_worker_thread(
        pid: int = os.fork()
        if pid == 0:
            # CHILD: close the pid-pipe ends (we don't use
-            # them here), run the user callable if any, exit.
+            # them here), then scrub ALL other inherited FDs
            # so the child starts with a clean slate
            # (stdio-only). Critical for multi-level spawn
            # trees — see `_close_inherited_fds()` docstring.
            os.close(rfd)
            os.close(wfd)
            _close_inherited_fds()
            rc: int = 0
            if child_target is not None:
                try: