From e3f4f5a387811f10082c55828158cfd18e17b2eb Mon Sep 17 00:00:00 2001 From: goodboy Date: Thu, 23 Apr 2026 13:58:42 -0400 Subject: [PATCH] Add `subint_forkserver` test-cancellation leak doc MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit New `ai/conc-anal/ subint_forkserver_test_cancellation_leak_issue.md` captures a descendant-leak surfaced while wiring `subint_forkserver` into the full test matrix: running `tests/test_cancellation.py` under `--spawn-backend=subint_forkserver` reproducibly leaks **exactly 5** `subint-forkserv` comm-named child processes that survive session exit, each holding a `LISTEN` on `:1616` (the tractor default registry addr) — and therefore poisons every subsequent test session that defaults to that addr. Deats, - TL;DR + ruled-out checks confirming the procs are ours (not piker / other tractor-embedding apps) — `/proc/$pid/cmdline` + cwd both resolve to this repo's `py314/` venv - root cause: `_ForkedProc.kill()` is PID-scoped (plain `os.kill(SIGKILL)` to the direct child), not tree-scoped — grandchildren spawned during a multi-level cancel test get reparented to init and inherit the registry listen socket - proposed fix directions ranked: (1) put each forkserver-spawned subactor in its own process- group (`os.setpgrp()` in fork-child) + tree-kill via `os.killpg(pgid, SIGKILL)` on teardown, (2) `PR_SET_CHILD_SUBREAPER` on root, (3) explicit `/proc//task/*/children` walk. Vote: (1) — POSIX-standard, aligns w/ `start_new_session=True` semantics in `subprocess.Popen` / trio's `open_process` - inline reproducer + cleanup recipe scoped to `$(pwd)/py314/bin/python.*pytest.*spawn-backend= subint_forkserver` so cleanup doesn't false-flag unrelated tractor procs (consistent w/ `run-tests` skill's zombie-check guidance) Stopgap hygiene fix (wiring `reg_addr` through the 5 leaky tests in `test_cancellation.py`) is incoming as a follow-up — that one stops the blast radius, but zombies still accumulate per-run until the real tree-kill fix lands. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code --- ...forkserver_test_cancellation_leak_issue.md | 165 ++++++++++++++++++ 1 file changed, 165 insertions(+) create mode 100644 ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md diff --git a/ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md b/ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md new file mode 100644 index 00000000..987b2b52 --- /dev/null +++ b/ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md @@ -0,0 +1,165 @@ +# `subint_forkserver` backend leaks subactor descendants in `test_cancellation.py` + +Follow-up tracker: surfaced while wiring the new +`subint_forkserver` spawn backend into the full tractor +test matrix (step 2 of the post-backend-lands plan; +see also +`ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md`). + +## TL;DR + +Running `tests/test_cancellation.py` under +`--spawn-backend=subint_forkserver` reproducibly leaks +**exactly 5 `subint-forkserv` comm-named child processes** +after the pytest session exits. Both previously-run +sessions produced the same 5-process signature — not a +flake. Each leaked process holds a `LISTEN` on the +default registry TCP addr (`127.0.0.1:1616`), which +poisons any subsequent tractor test session that +defaults to that addr. + +## Stopgap (not the real fix) + +Multiple tests in `test_cancellation.py` were calling +`tractor.open_nursery()` **without** passing +`registry_addrs=[reg_addr]`, i.e. falling back on the +default `:1616`. The commit accompanying this doc wires +the `reg_addr` fixture through those tests so each run +gets a session-unique port — leaked zombies can no +longer poison **other** tests (they hold their own +unique port instead). + +Tests touched (in `tests/test_cancellation.py`): + +- `test_cancel_infinite_streamer` +- `test_some_cancels_all` +- `test_nested_multierrors` +- `test_cancel_via_SIGINT` +- `test_cancel_via_SIGINT_other_task` + +This is a **suite-hygiene fix** — it doesn't close the +actual leak; it just stops the leak from blast-radiusing. +Zombie descendants still accumulate per run. + +## The real bug (unfixed) + +`subint_forkserver_proc`'s teardown — `_ForkedProc.kill()` +(plain `os.kill(SIGKILL)` to the direct child pid) + +`proc.wait()` — does **not** reap grandchildren or +deeper descendants. When a cancellation test causes a +multi-level actor tree to tear down, the direct child +dies but its own children survive and get reparented to +init (PID 1), where they stay running with their +inherited FDs (including the registry listen socket). + +**Symptom on repro:** + +``` +$ ss -tlnp 2>/dev/null | grep ':1616' +LISTEN 0 4096 127.0.0.1:1616 0.0.0.0:* \ + users:(("subint-forkserv",pid=211595,fd=17), + ("subint-forkserv",pid=211585,fd=17), + ("subint-forkserv",pid=211583,fd=17), + ("subint-forkserv",pid=211576,fd=17), + ("subint-forkserv",pid=211572,fd=17)) + +$ for p in 211572 211576 211583 211585 211595; do + cat /proc/$p/cmdline | tr '\0' ' '; echo; done +./py314/bin/python -m pytest --spawn-backend=subint_forkserver \ + tests/test_cancellation.py --timeout=30 --timeout-method=signal \ + --tb=no -q --no-header +... (x5, all same cmdline — inherited from fork) +``` + +All 5 share the pytest cmdline because `os.fork()` +without `exec()` preserves the parent's argv. Their +comm-name (`subint-forkserv`) is the `thread_name` we +pass to the fork-worker thread in +`tractor.spawn._subint_forkserver.fork_from_worker_thread`. + +## Why 5? + +Not confirmed; guess is 5 = the parametrize cardinality +of one of the leaky tests (e.g. `test_some_cancels_all` +has 5 parametrize cases). Each param-case spawns a +nested tree; each leaks exactly one descendant. Worth +verifying by running each parametrize-case individually +and counting leaked procs per case. + +## Ruled out + +- **`:1616` collision from a different repo** (e.g. + piker): `/proc/$pid/cmdline` + `cwd` both resolve to + the tractor repo's `py314/` venv for all 5. These are + definitively spawned by our test run. +- **Parent-side `_ForkedProc.wait()` regressed**: the + direct child's teardown completes cleanly (exit-code + captured, `waitpid` returns); the 5 survivors are + deeper-descendants whose parent-side shim has no + handle on them. So the bug isn't in + `_ForkedProc.wait()` — it's in the lack of tree- + level descendant enumeration + reaping during nursery + teardown. + +## Likely fix directions + +1. **Process-group-scoped spawn + tree kill.** Put each + forkserver-spawned subactor into its own process + group (`os.setpgrp()` in the fork child), then on + teardown `os.killpg(pgid, SIGKILL)` to reap the + whole tree atomically. Simplest, most surgical. +2. **Subreaper registration.** Use + `PR_SET_CHILD_SUBREAPER` on the tractor root so + orphaned grandchildren reparent to the root rather + than init — then we can `waitpid` them from the + parent-side nursery teardown. More invasive. +3. **Explicit descendant enumeration at teardown.** + In `subint_forkserver_proc`'s finally block, walk + `/proc//task/*/children` before issuing SIGKILL + to build a descendant-pid set; then kill + reap all + of them. Fragile (Linux-only, proc-fs-scan race). + +Vote: **(1)** — clean, POSIX-standard, aligns with how +`subprocess.Popen` (and by extension `trio.lowlevel. +open_process`) handle tree-kill semantics on +kwargs-supplied `start_new_session=True`. + +## Reproducer + +```sh +# before: ensure clean env +ss -tlnp 2>/dev/null | grep ':1616' || echo 'clean' + +# run the leaky tests +./py314/bin/python -m pytest \ + --spawn-backend=subint_forkserver \ + tests/test_cancellation.py \ + --timeout=30 --timeout-method=signal --tb=no -q --no-header + +# observe: 5 leaked children now holding :1616 +ss -tlnp 2>/dev/null | grep ':1616' +``` + +Expected output: `subint-forkserv` processes listed as +listeners on `:1616`. Cleanup: + +```sh +pkill -9 -f \ + "$(pwd)/py314/bin/python.*pytest.*spawn-backend=subint_forkserver" +``` + +## References + +- `tractor/spawn/_subint_forkserver.py::_ForkedProc` + — the current teardown shim; PID-scoped, not tree- + scoped. +- `tractor/spawn/_subint_forkserver.py::subint_forkserver_proc` + — the spawn backend whose `finally` block needs the + tree-kill fix. +- `tests/test_cancellation.py` — the surface where the + leak surfaces. +- `ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md` + — sibling tracker for a different forkserver-teardown + class (orphaned child doesn't respond to SIGINT); may + share root cause with this one once the fix lands. +- tractor issue #379 — subint backend tracking.