Use SIGINT-first ladder in `run-tests` cleanup

The previous cleanup recipe went straight to SIGTERM+SIGKILL, which hides bugs: tractor is structured concurrent — `_trio_main` catches SIGINT as an OS-cancel and cascades `Portal.cancel_actor` over IPC to every descendant. So a graceful SIGINT exercises the actual SC teardown path; if it hangs, that's a real bug to file (the forkserver `:1616` zombie was originally suspected to be one of these but turned out to be a teardown gap in `_ForkedProc.kill()` instead). Deats, - step 1: `pkill -INT` scoped to `$(pwd)/py*` — no sleep yet, just send the signal - step 2: bounded wait loop (10 × 0.3s = ~3s) using `pgrep` to poll for exit. Loop breaks early on clean exit - step 3: `pkill -9` only if graceful timed out, w/ a logged escalation msg so it's obvious when SC teardown didn't complete - step 4: same SIGINT-first ladder for the rare `:1616`-holding zombie that doesn't match the cmdline pattern (find PID via `ss -tlnp`, then `kill -INT NNNN; sleep 1; kill -9 NNNN`) - steps 5-6: UDS-socket `rm -f` + re-verify unchanged Goal: surface real teardown bugs through the test- cleanup workflow instead of papering over them with `-9`. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
Wire `reg_addr` through leaky cancel tests
2026-04-23 14:46:16 -04:00 · 2026-04-23 14:37:48 -04:00 · 2026-04-23 13:58:42 -04:00
3 changed files with 207 additions and 13 deletions
--- a/.claude/skills/run-tests/SKILL.md
+++ b/.claude/skills/run-tests/SKILL.md
@ -249,22 +249,38 @@ ls -la /tmp/registry@*.sock 2>/dev/null \
  surface PIDs + cmdlines to the user, offer cleanup:

  ```sh
-  # 1. kill test zombies scoped to THIS repo's python only
-  #    (don't pkill by bare pattern — that'd nuke legit
-  #    long-running tractor apps like piker)
-  pkill -f "$(pwd)/py[0-9]*/bin/python.*_actor_child_main|subint-forkserv"
-  sleep 0.3
-  pkill -9 -f "$(pwd)/py[0-9]*/bin/python.*_actor_child_main|subint-forkserv" 2>/dev/null
+  # 1. GRACEFUL FIRST (tractor is structured concurrent — it
+  #    catches SIGINT as an OS-cancel in `_trio_main` and
+  #    cascades Portal.cancel_actor via IPC to every descendant.
+  #    So always try SIGINT first with a bounded timeout; only
+  #    escalate to SIGKILL if graceful cleanup doesn't complete).
+  pkill -INT -f "$(pwd)/py[0-9]*/bin/python.*_actor_child_main|subint-forkserv"

-  # 2. if a test zombie holds :1616 specifically and doesn't
+  # 2. bounded wait for graceful teardown (usually sub-second).
+  #    Loop until the processes exit, or timeout. Keep the
+  #    bound tight — hung/abrupt-killed descendants usually
+  #    hang forever, so don't wait more than a few seconds.
+  for i in $(seq 1 10); do
+    pgrep -f "$(pwd)/py[0-9]*/bin/python.*_actor_child_main|subint-forkserv" >/dev/null || break
+    sleep 0.3
+  done
+
+  # 3. ESCALATE TO SIGKILL only if graceful didn't finish.
+  if pgrep -f "$(pwd)/py[0-9]*/bin/python.*_actor_child_main|subint-forkserv" >/dev/null; then
+    echo 'graceful teardown timed out — escalating to SIGKILL'
+    pkill -9 -f "$(pwd)/py[0-9]*/bin/python.*_actor_child_main|subint-forkserv"
+  fi
+
+  # 4. if a test zombie holds :1616 specifically and doesn't
  #    match the above pattern, find its PID the hard way:
  ss -tlnp 2>/dev/null | grep ':1616'   # prints `users:(("<name>",pid=NNNN,...))`
-  # then: kill <NNNN>
+  # then (same SIGINT-first ladder):
+  # kill -INT <NNNN>; sleep 1; kill -9 <NNNN> 2>/dev/null

-  # 3. remove stale UDS sockets
+  # 5. remove stale UDS sockets
  rm -f /tmp/registry@*.sock

-  # 4. re-verify
+  # 6. re-verify
  ss -tlnp 2>/dev/null | grep ':1616' || echo 'TCP :1616 now free'
  ```

--- a/ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md
+++ b/ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md
@ -0,0 +1,165 @@
+# `subint_forkserver` backend leaks subactor descendants in `test_cancellation.py`
+
+Follow-up tracker: surfaced while wiring the new
+`subint_forkserver` spawn backend into the full tractor
+test matrix (step 2 of the post-backend-lands plan;
+see also
+`ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md`).
+
+## TL;DR
+
+Running `tests/test_cancellation.py` under
+`--spawn-backend=subint_forkserver` reproducibly leaks
+**exactly 5 `subint-forkserv` comm-named child processes**
+after the pytest session exits. Both previously-run
+sessions produced the same 5-process signature — not a
+flake. Each leaked process holds a `LISTEN` on the
+default registry TCP addr (`127.0.0.1:1616`), which
+poisons any subsequent tractor test session that
+defaults to that addr.
+
+## Stopgap (not the real fix)
+
+Multiple tests in `test_cancellation.py` were calling
+`tractor.open_nursery()` **without** passing
+`registry_addrs=[reg_addr]`, i.e. falling back on the
+default `:1616`. The commit accompanying this doc wires
+the `reg_addr` fixture through those tests so each run
+gets a session-unique port — leaked zombies can no
+longer poison **other** tests (they hold their own
+unique port instead).
+
+Tests touched (in `tests/test_cancellation.py`):
+
+- `test_cancel_infinite_streamer`
+- `test_some_cancels_all`
+- `test_nested_multierrors`
+- `test_cancel_via_SIGINT`
+- `test_cancel_via_SIGINT_other_task`
+
+This is a **suite-hygiene fix** — it doesn't close the
+actual leak; it just stops the leak from blast-radiusing.
+Zombie descendants still accumulate per run.
+
+## The real bug (unfixed)
+
+`subint_forkserver_proc`'s teardown — `_ForkedProc.kill()`
+(plain `os.kill(SIGKILL)` to the direct child pid) +
+`proc.wait()` — does **not** reap grandchildren or
+deeper descendants. When a cancellation test causes a
+multi-level actor tree to tear down, the direct child
+dies but its own children survive and get reparented to
+init (PID 1), where they stay running with their
+inherited FDs (including the registry listen socket).
+
+**Symptom on repro:**
+
+```
+$ ss -tlnp 2>/dev/null | grep ':1616'
+LISTEN 0 4096 127.0.0.1:1616 0.0.0.0:* \
+    users:(("subint-forkserv",pid=211595,fd=17),
+           ("subint-forkserv",pid=211585,fd=17),
+           ("subint-forkserv",pid=211583,fd=17),
+           ("subint-forkserv",pid=211576,fd=17),
+           ("subint-forkserv",pid=211572,fd=17))
+
+$ for p in 211572 211576 211583 211585 211595; do
+    cat /proc/$p/cmdline | tr '\0' ' '; echo; done
+./py314/bin/python -m pytest --spawn-backend=subint_forkserver \
+  tests/test_cancellation.py --timeout=30 --timeout-method=signal \
+  --tb=no -q --no-header
+... (x5, all same cmdline — inherited from fork)
+```
+
+All 5 share the pytest cmdline because `os.fork()`
+without `exec()` preserves the parent's argv. Their
+comm-name (`subint-forkserv`) is the `thread_name` we
+pass to the fork-worker thread in
+`tractor.spawn._subint_forkserver.fork_from_worker_thread`.
+
+## Why 5?
+
+Not confirmed; guess is 5 = the parametrize cardinality
+of one of the leaky tests (e.g. `test_some_cancels_all`
+has 5 parametrize cases). Each param-case spawns a
+nested tree; each leaks exactly one descendant. Worth
+verifying by running each parametrize-case individually
+and counting leaked procs per case.
+
+## Ruled out
+
+- **`:1616` collision from a different repo** (e.g.
+  piker): `/proc/$pid/cmdline` + `cwd` both resolve to
+  the tractor repo's `py314/` venv for all 5. These are
+  definitively spawned by our test run.
+- **Parent-side `_ForkedProc.wait()` regressed**: the
+  direct child's teardown completes cleanly (exit-code
+  captured, `waitpid` returns); the 5 survivors are
+  deeper-descendants whose parent-side shim has no
+  handle on them. So the bug isn't in
+  `_ForkedProc.wait()` — it's in the lack of tree-
+  level descendant enumeration + reaping during nursery
+  teardown.
+
+## Likely fix directions
+
+1. **Process-group-scoped spawn + tree kill.** Put each
+   forkserver-spawned subactor into its own process
+   group (`os.setpgrp()` in the fork child), then on
+   teardown `os.killpg(pgid, SIGKILL)` to reap the
+   whole tree atomically. Simplest, most surgical.
+2. **Subreaper registration.** Use
+   `PR_SET_CHILD_SUBREAPER` on the tractor root so
+   orphaned grandchildren reparent to the root rather
+   than init — then we can `waitpid` them from the
+   parent-side nursery teardown. More invasive.
+3. **Explicit descendant enumeration at teardown.**
+   In `subint_forkserver_proc`'s finally block, walk
+   `/proc/<pid>/task/*/children` before issuing SIGKILL
+   to build a descendant-pid set; then kill + reap all
+   of them. Fragile (Linux-only, proc-fs-scan race).
+
+Vote: **(1)** — clean, POSIX-standard, aligns with how
+`subprocess.Popen` (and by extension `trio.lowlevel.
+open_process`) handle tree-kill semantics on
+kwargs-supplied `start_new_session=True`.
+
+## Reproducer
+
+```sh
+# before: ensure clean env
+ss -tlnp 2>/dev/null | grep ':1616' || echo 'clean'
+
+# run the leaky tests
+./py314/bin/python -m pytest \
+  --spawn-backend=subint_forkserver \
+  tests/test_cancellation.py \
+  --timeout=30 --timeout-method=signal --tb=no -q --no-header
+
+# observe: 5 leaked children now holding :1616
+ss -tlnp 2>/dev/null | grep ':1616'
+```
+
+Expected output: `subint-forkserv` processes listed as
+listeners on `:1616`. Cleanup:
+
+```sh
+pkill -9 -f \
+  "$(pwd)/py314/bin/python.*pytest.*spawn-backend=subint_forkserver"
+```
+
+## References
+
+- `tractor/spawn/_subint_forkserver.py::_ForkedProc`
+  — the current teardown shim; PID-scoped, not tree-
+  scoped.
+- `tractor/spawn/_subint_forkserver.py::subint_forkserver_proc`
+  — the spawn backend whose `finally` block needs the
+  tree-kill fix.
+- `tests/test_cancellation.py` — the surface where the
+  leak surfaces.
+- `ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md`
+  — sibling tracker for a different forkserver-teardown
+  class (orphaned child doesn't respond to SIGINT); may
+  share root cause with this one once the fix lands.
+- tractor issue #379 — subint backend tracking.
--- a/tests/test_cancellation.py
+++ b/tests/test_cancellation.py
@ -275,7 +275,8 @@ async def stream_forever():
    timeout=6,
 )
 async def test_cancel_infinite_streamer(
-    start_method: str
+    reg_addr: tuple,
+    start_method: str,
 ):
    # stream for at most 1 seconds
    with (
@ -341,6 +342,7 @@ async def test_cancel_infinite_streamer(
 )
 async def test_some_cancels_all(
    num_actors_and_errs: tuple,
+    reg_addr: tuple,
    start_method: str,
    loglevel: str,
 ):
@ -450,8 +452,13 @@ async def spawn_and_error(
            await nursery.run_in_actor(*args, **kwargs)


+@pytest.mark.timeout(
+    10,
+    method='thread',
+)
@tractor_test
 async def test_nested_multierrors(
+    reg_addr: tuple,
    loglevel: str,
    start_method: str,
 ):
@ -541,6 +548,7 @@ async def test_nested_multierrors(

@no_windows
 def test_cancel_via_SIGINT(
+    reg_addr: tuple,
    loglevel: str,
    start_method: str,
 ):
@ -553,7 +561,9 @@ def test_cancel_via_SIGINT(

    async def main():
        with trio.fail_after(2):
-            async with tractor.open_nursery() as tn:
+            async with tractor.open_nursery(
+                registry_addrs=[reg_addr],
+            ) as tn:
                await tn.start_actor('sucka')
                if 'mp' in start_method:
                    time.sleep(0.1)
@ -566,6 +576,7 @@ def test_cancel_via_SIGINT(

@no_windows
 def test_cancel_via_SIGINT_other_task(
+    reg_addr: tuple,
    loglevel: str,
    start_method: str,
    spawn_backend: str,
@ -594,7 +605,9 @@ def test_cancel_via_SIGINT_other_task(
    async def spawn_and_sleep_forever(
        task_status=trio.TASK_STATUS_IGNORED
    ):
-        async with tractor.open_nursery() as tn:
+        async with tractor.open_nursery(
+            registry_addrs=[reg_addr],
+        ) as tn:
            for i in range(3):
                await tn.run_in_actor(
                    sleep_forever,