Compare commits

...

3 Commits

Author SHA1 Message Date
Gud Boi e5e2afb5f4 Use SIGINT-first ladder in `run-tests` cleanup
The previous cleanup recipe went straight to
SIGTERM+SIGKILL, which hides bugs: tractor is
structured concurrent — `_trio_main` catches SIGINT
as an OS-cancel and cascades `Portal.cancel_actor`
over IPC to every descendant. So a graceful SIGINT
exercises the actual SC teardown path; if it hangs,
that's a real bug to file (the forkserver `:1616`
zombie was originally suspected to be one of these
but turned out to be a teardown gap in
`_ForkedProc.kill()` instead).

Deats,
- step 1: `pkill -INT` scoped to `$(pwd)/py*` — no
  sleep yet, just send the signal
- step 2: bounded wait loop (10 × 0.3s = ~3s) using
  `pgrep` to poll for exit. Loop breaks early on
  clean exit
- step 3: `pkill -9` only if graceful timed out, w/
  a logged escalation msg so it's obvious when SC
  teardown didn't complete
- step 4: same SIGINT-first ladder for the rare
  `:1616`-holding zombie that doesn't match the
  cmdline pattern (find PID via `ss -tlnp`, then
  `kill -INT NNNN; sleep 1; kill -9 NNNN`)
- steps 5-6: UDS-socket `rm -f` + re-verify
  unchanged

Goal: surface real teardown bugs through the test-
cleanup workflow instead of papering over them with
`-9`.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-23 14:46:16 -04:00
Gud Boi de6016763f Wire `reg_addr` through leaky cancel tests
Stopgap companion to d0121960 (`subint_forkserver`
test-cancellation leak doc): five tests in
`tests/test_cancellation.py` were running against the
default `:1616` registry, so any leaked
`subint-forkserv` descendant from a prior test holds
the port and blows up every subsequent run with
`TooSlowError` / "address in use". Thread the
session-unique `reg_addr` fixture through so each run
picks its own port — zombies can no longer poison
other tests (they'll only cross-contaminate whatever
happens to share their port, which is now nothing).

Deats,
- add `reg_addr: tuple` fixture param to:
  - `test_cancel_infinite_streamer`
  - `test_some_cancels_all`
  - `test_nested_multierrors`
  - `test_cancel_via_SIGINT`
  - `test_cancel_via_SIGINT_other_task`
- explicitly pass `registry_addrs=[reg_addr]` to the
  two `open_nursery()` calls that previously had no
  kwargs at all (in `test_cancel_via_SIGINT` and
  `test_cancel_via_SIGINT_other_task`)
- add bounded `@pytest.mark.timeout(7, method='thread')`
  to `test_nested_multierrors` so a hung run doesn't
  wedge the whole session

Still doesn't close the real leak — the
`subint_forkserver` backend's `_ForkedProc.kill()` is
PID-scoped not tree-scoped, so grandchildren survive
teardown regardless of registry port. This commit is
just blast-radius containment until that fix lands.
See `ai/conc-anal/
subint_forkserver_test_cancellation_leak_issue.md`.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-23 14:37:48 -04:00
Gud Boi d0121960b9 Add `subint_forkserver` test-cancellation leak doc
New `ai/conc-anal/
subint_forkserver_test_cancellation_leak_issue.md`
captures a descendant-leak surfaced while wiring
`subint_forkserver` into the full test matrix:
running `tests/test_cancellation.py` under
`--spawn-backend=subint_forkserver` reproducibly
leaks **exactly 5** `subint-forkserv` comm-named
child processes that survive session exit, each
holding a `LISTEN` on `:1616` (the tractor default
registry addr) — and therefore poisons every
subsequent test session that defaults to that addr.

Deats,
- TL;DR + ruled-out checks confirming the procs are
  ours (not piker / other tractor-embedding apps) —
  `/proc/$pid/cmdline` + cwd both resolve to this
  repo's `py314/` venv
- root cause: `_ForkedProc.kill()` is PID-scoped
  (plain `os.kill(SIGKILL)` to the direct child),
  not tree-scoped — grandchildren spawned during a
  multi-level cancel test get reparented to init and
  inherit the registry listen socket
- proposed fix directions ranked: (1) put each
  forkserver-spawned subactor in its own process-
  group (`os.setpgrp()` in fork-child) + tree-kill
  via `os.killpg(pgid, SIGKILL)` on teardown,
  (2) `PR_SET_CHILD_SUBREAPER` on root, (3) explicit
  `/proc/<pid>/task/*/children` walk. Vote: (1) —
  POSIX-standard, aligns w/ `start_new_session=True`
  semantics in `subprocess.Popen` / trio's
  `open_process`
- inline reproducer + cleanup recipe scoped to
  `$(pwd)/py314/bin/python.*pytest.*spawn-backend=
  subint_forkserver` so cleanup doesn't false-flag
  unrelated tractor procs (consistent w/
  `run-tests` skill's zombie-check guidance)

Stopgap hygiene fix (wiring `reg_addr` through the 5
leaky tests in `test_cancellation.py`) is incoming as
a follow-up — that one stops the blast radius, but
zombies still accumulate per-run until the real
tree-kill fix lands.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-23 13:58:42 -04:00
3 changed files with 207 additions and 13 deletions

View File

@ -249,22 +249,38 @@ ls -la /tmp/registry@*.sock 2>/dev/null \
surface PIDs + cmdlines to the user, offer cleanup:
```sh
# 1. kill test zombies scoped to THIS repo's python only
# (don't pkill by bare pattern — that'd nuke legit
# long-running tractor apps like piker)
pkill -f "$(pwd)/py[0-9]*/bin/python.*_actor_child_main|subint-forkserv"
sleep 0.3
pkill -9 -f "$(pwd)/py[0-9]*/bin/python.*_actor_child_main|subint-forkserv" 2>/dev/null
# 1. GRACEFUL FIRST (tractor is structured concurrent — it
# catches SIGINT as an OS-cancel in `_trio_main` and
# cascades Portal.cancel_actor via IPC to every descendant.
# So always try SIGINT first with a bounded timeout; only
# escalate to SIGKILL if graceful cleanup doesn't complete).
pkill -INT -f "$(pwd)/py[0-9]*/bin/python.*_actor_child_main|subint-forkserv"
# 2. if a test zombie holds :1616 specifically and doesn't
# 2. bounded wait for graceful teardown (usually sub-second).
# Loop until the processes exit, or timeout. Keep the
# bound tight — hung/abrupt-killed descendants usually
# hang forever, so don't wait more than a few seconds.
for i in $(seq 1 10); do
pgrep -f "$(pwd)/py[0-9]*/bin/python.*_actor_child_main|subint-forkserv" >/dev/null || break
sleep 0.3
done
# 3. ESCALATE TO SIGKILL only if graceful didn't finish.
if pgrep -f "$(pwd)/py[0-9]*/bin/python.*_actor_child_main|subint-forkserv" >/dev/null; then
echo 'graceful teardown timed out — escalating to SIGKILL'
pkill -9 -f "$(pwd)/py[0-9]*/bin/python.*_actor_child_main|subint-forkserv"
fi
# 4. if a test zombie holds :1616 specifically and doesn't
# match the above pattern, find its PID the hard way:
ss -tlnp 2>/dev/null | grep ':1616' # prints `users:(("<name>",pid=NNNN,...))`
# then: kill <NNNN>
# then (same SIGINT-first ladder):
# kill -INT <NNNN>; sleep 1; kill -9 <NNNN> 2>/dev/null
# 3. remove stale UDS sockets
# 5. remove stale UDS sockets
rm -f /tmp/registry@*.sock
# 4. re-verify
# 6. re-verify
ss -tlnp 2>/dev/null | grep ':1616' || echo 'TCP :1616 now free'
```

View File

@ -0,0 +1,165 @@
# `subint_forkserver` backend leaks subactor descendants in `test_cancellation.py`
Follow-up tracker: surfaced while wiring the new
`subint_forkserver` spawn backend into the full tractor
test matrix (step 2 of the post-backend-lands plan;
see also
`ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md`).
## TL;DR
Running `tests/test_cancellation.py` under
`--spawn-backend=subint_forkserver` reproducibly leaks
**exactly 5 `subint-forkserv` comm-named child processes**
after the pytest session exits. Both previously-run
sessions produced the same 5-process signature — not a
flake. Each leaked process holds a `LISTEN` on the
default registry TCP addr (`127.0.0.1:1616`), which
poisons any subsequent tractor test session that
defaults to that addr.
## Stopgap (not the real fix)
Multiple tests in `test_cancellation.py` were calling
`tractor.open_nursery()` **without** passing
`registry_addrs=[reg_addr]`, i.e. falling back on the
default `:1616`. The commit accompanying this doc wires
the `reg_addr` fixture through those tests so each run
gets a session-unique port — leaked zombies can no
longer poison **other** tests (they hold their own
unique port instead).
Tests touched (in `tests/test_cancellation.py`):
- `test_cancel_infinite_streamer`
- `test_some_cancels_all`
- `test_nested_multierrors`
- `test_cancel_via_SIGINT`
- `test_cancel_via_SIGINT_other_task`
This is a **suite-hygiene fix** — it doesn't close the
actual leak; it just stops the leak from blast-radiusing.
Zombie descendants still accumulate per run.
## The real bug (unfixed)
`subint_forkserver_proc`'s teardown — `_ForkedProc.kill()`
(plain `os.kill(SIGKILL)` to the direct child pid) +
`proc.wait()` — does **not** reap grandchildren or
deeper descendants. When a cancellation test causes a
multi-level actor tree to tear down, the direct child
dies but its own children survive and get reparented to
init (PID 1), where they stay running with their
inherited FDs (including the registry listen socket).
**Symptom on repro:**
```
$ ss -tlnp 2>/dev/null | grep ':1616'
LISTEN 0 4096 127.0.0.1:1616 0.0.0.0:* \
users:(("subint-forkserv",pid=211595,fd=17),
("subint-forkserv",pid=211585,fd=17),
("subint-forkserv",pid=211583,fd=17),
("subint-forkserv",pid=211576,fd=17),
("subint-forkserv",pid=211572,fd=17))
$ for p in 211572 211576 211583 211585 211595; do
cat /proc/$p/cmdline | tr '\0' ' '; echo; done
./py314/bin/python -m pytest --spawn-backend=subint_forkserver \
tests/test_cancellation.py --timeout=30 --timeout-method=signal \
--tb=no -q --no-header
... (x5, all same cmdline — inherited from fork)
```
All 5 share the pytest cmdline because `os.fork()`
without `exec()` preserves the parent's argv. Their
comm-name (`subint-forkserv`) is the `thread_name` we
pass to the fork-worker thread in
`tractor.spawn._subint_forkserver.fork_from_worker_thread`.
## Why 5?
Not confirmed; guess is 5 = the parametrize cardinality
of one of the leaky tests (e.g. `test_some_cancels_all`
has 5 parametrize cases). Each param-case spawns a
nested tree; each leaks exactly one descendant. Worth
verifying by running each parametrize-case individually
and counting leaked procs per case.
## Ruled out
- **`:1616` collision from a different repo** (e.g.
piker): `/proc/$pid/cmdline` + `cwd` both resolve to
the tractor repo's `py314/` venv for all 5. These are
definitively spawned by our test run.
- **Parent-side `_ForkedProc.wait()` regressed**: the
direct child's teardown completes cleanly (exit-code
captured, `waitpid` returns); the 5 survivors are
deeper-descendants whose parent-side shim has no
handle on them. So the bug isn't in
`_ForkedProc.wait()` — it's in the lack of tree-
level descendant enumeration + reaping during nursery
teardown.
## Likely fix directions
1. **Process-group-scoped spawn + tree kill.** Put each
forkserver-spawned subactor into its own process
group (`os.setpgrp()` in the fork child), then on
teardown `os.killpg(pgid, SIGKILL)` to reap the
whole tree atomically. Simplest, most surgical.
2. **Subreaper registration.** Use
`PR_SET_CHILD_SUBREAPER` on the tractor root so
orphaned grandchildren reparent to the root rather
than init — then we can `waitpid` them from the
parent-side nursery teardown. More invasive.
3. **Explicit descendant enumeration at teardown.**
In `subint_forkserver_proc`'s finally block, walk
`/proc/<pid>/task/*/children` before issuing SIGKILL
to build a descendant-pid set; then kill + reap all
of them. Fragile (Linux-only, proc-fs-scan race).
Vote: **(1)** — clean, POSIX-standard, aligns with how
`subprocess.Popen` (and by extension `trio.lowlevel.
open_process`) handle tree-kill semantics on
kwargs-supplied `start_new_session=True`.
## Reproducer
```sh
# before: ensure clean env
ss -tlnp 2>/dev/null | grep ':1616' || echo 'clean'
# run the leaky tests
./py314/bin/python -m pytest \
--spawn-backend=subint_forkserver \
tests/test_cancellation.py \
--timeout=30 --timeout-method=signal --tb=no -q --no-header
# observe: 5 leaked children now holding :1616
ss -tlnp 2>/dev/null | grep ':1616'
```
Expected output: `subint-forkserv` processes listed as
listeners on `:1616`. Cleanup:
```sh
pkill -9 -f \
"$(pwd)/py314/bin/python.*pytest.*spawn-backend=subint_forkserver"
```
## References
- `tractor/spawn/_subint_forkserver.py::_ForkedProc`
— the current teardown shim; PID-scoped, not tree-
scoped.
- `tractor/spawn/_subint_forkserver.py::subint_forkserver_proc`
— the spawn backend whose `finally` block needs the
tree-kill fix.
- `tests/test_cancellation.py` — the surface where the
leak surfaces.
- `ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md`
— sibling tracker for a different forkserver-teardown
class (orphaned child doesn't respond to SIGINT); may
share root cause with this one once the fix lands.
- tractor issue #379 — subint backend tracking.

View File

@ -275,7 +275,8 @@ async def stream_forever():
timeout=6,
)
async def test_cancel_infinite_streamer(
start_method: str
reg_addr: tuple,
start_method: str,
):
# stream for at most 1 seconds
with (
@ -341,6 +342,7 @@ async def test_cancel_infinite_streamer(
)
async def test_some_cancels_all(
num_actors_and_errs: tuple,
reg_addr: tuple,
start_method: str,
loglevel: str,
):
@ -450,8 +452,13 @@ async def spawn_and_error(
await nursery.run_in_actor(*args, **kwargs)
@pytest.mark.timeout(
10,
method='thread',
)
@tractor_test
async def test_nested_multierrors(
reg_addr: tuple,
loglevel: str,
start_method: str,
):
@ -541,6 +548,7 @@ async def test_nested_multierrors(
@no_windows
def test_cancel_via_SIGINT(
reg_addr: tuple,
loglevel: str,
start_method: str,
):
@ -553,7 +561,9 @@ def test_cancel_via_SIGINT(
async def main():
with trio.fail_after(2):
async with tractor.open_nursery() as tn:
async with tractor.open_nursery(
registry_addrs=[reg_addr],
) as tn:
await tn.start_actor('sucka')
if 'mp' in start_method:
time.sleep(0.1)
@ -566,6 +576,7 @@ def test_cancel_via_SIGINT(
@no_windows
def test_cancel_via_SIGINT_other_task(
reg_addr: tuple,
loglevel: str,
start_method: str,
spawn_backend: str,
@ -594,7 +605,9 @@ def test_cancel_via_SIGINT_other_task(
async def spawn_and_sleep_forever(
task_status=trio.TASK_STATUS_IGNORED
):
async with tractor.open_nursery() as tn:
async with tractor.open_nursery(
registry_addrs=[reg_addr],
) as tn:
for i in range(3):
await tn.run_in_actor(
sleep_forever,