Compare commits
No commits in common. "0cd0b633f1793b17f94ed0ba07c6540e44c60606" and "e5e2afb5f4890cb0c8f7f3bb38ad12429f4e0b59" have entirely different histories.
0cd0b633f1
...
e5e2afb5f4
|
|
@ -1,333 +1,165 @@
|
||||||
# `subint_forkserver` backend: `test_cancellation.py` multi-level cancel cascade hang
|
# `subint_forkserver` backend leaks subactor descendants in `test_cancellation.py`
|
||||||
|
|
||||||
Follow-up tracker: surfaced while wiring the new
|
Follow-up tracker: surfaced while wiring the new
|
||||||
`subint_forkserver` spawn backend into the full tractor
|
`subint_forkserver` spawn backend into the full tractor
|
||||||
test matrix (step 2 of the post-backend-lands plan).
|
test matrix (step 2 of the post-backend-lands plan;
|
||||||
See also
|
see also
|
||||||
`ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md`
|
`ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md`).
|
||||||
— sibling tracker for a different forkserver-teardown
|
|
||||||
class which probably shares the same fundamental root
|
|
||||||
cause (fork-FD-inheritance across nested spawns).
|
|
||||||
|
|
||||||
## TL;DR
|
## TL;DR
|
||||||
|
|
||||||
`tests/test_cancellation.py::test_nested_multierrors[subint_forkserver]`
|
Running `tests/test_cancellation.py` under
|
||||||
hangs indefinitely under our new backend. The hang is
|
`--spawn-backend=subint_forkserver` reproducibly leaks
|
||||||
**inside the graceful IPC cancel cascade** — every actor
|
**exactly 5 `subint-forkserv` comm-named child processes**
|
||||||
in the multi-level tree parks in `epoll_wait` waiting
|
after the pytest session exits. Both previously-run
|
||||||
for IPC messages that never arrive. Not a hard-kill /
|
sessions produced the same 5-process signature — not a
|
||||||
tree-reap issue (we don't reach the hard-kill fallback
|
flake. Each leaked process holds a `LISTEN` on the
|
||||||
path at all).
|
default registry TCP addr (`127.0.0.1:1616`), which
|
||||||
|
poisons any subsequent tractor test session that
|
||||||
|
defaults to that addr.
|
||||||
|
|
||||||
Working hypothesis (unverified): **`os.fork()` from a
|
## Stopgap (not the real fix)
|
||||||
subactor inherits the root parent's IPC listener socket
|
|
||||||
FDs**. When a first-level subactor forkserver-spawns a
|
|
||||||
grandchild, that grandchild inherits both its direct
|
|
||||||
spawner's FDs AND the root's FDs — IPC message routing
|
|
||||||
becomes ambiguous (or silently sends to the wrong
|
|
||||||
channel), so the cancel cascade can't reach its target.
|
|
||||||
|
|
||||||
## Corrected diagnosis vs. earlier draft
|
Multiple tests in `test_cancellation.py` were calling
|
||||||
|
`tractor.open_nursery()` **without** passing
|
||||||
|
`registry_addrs=[reg_addr]`, i.e. falling back on the
|
||||||
|
default `:1616`. The commit accompanying this doc wires
|
||||||
|
the `reg_addr` fixture through those tests so each run
|
||||||
|
gets a session-unique port — leaked zombies can no
|
||||||
|
longer poison **other** tests (they hold their own
|
||||||
|
unique port instead).
|
||||||
|
|
||||||
An earlier version of this doc claimed the root cause
|
Tests touched (in `tests/test_cancellation.py`):
|
||||||
was **"forkserver teardown doesn't tree-kill
|
|
||||||
descendants"** (SIGKILL only reaches the direct child,
|
|
||||||
grandchildren survive and hold TCP `:1616`). That
|
|
||||||
diagnosis was **wrong**, caused by conflating two
|
|
||||||
observations:
|
|
||||||
|
|
||||||
1. *5-zombie leak holding :1616* — happened in my own
|
- `test_cancel_infinite_streamer`
|
||||||
workflow when I aborted a bg pytest task with
|
- `test_some_cancels_all`
|
||||||
`pkill` (SIGTERM/SIGKILL, not SIGINT). The abrupt
|
- `test_nested_multierrors`
|
||||||
kill skipped the graceful `ActorNursery.__aexit__`
|
- `test_cancel_via_SIGINT`
|
||||||
cancel cascade entirely, orphaning descendants to
|
- `test_cancel_via_SIGINT_other_task`
|
||||||
init. **This was my cleanup bug, not a forkserver
|
|
||||||
teardown bug.** Codified the fix (SIGINT-first +
|
|
||||||
bounded wait before SIGKILL) in
|
|
||||||
`feedback_sc_graceful_cancel_first.md` +
|
|
||||||
`.claude/skills/run-tests/SKILL.md`.
|
|
||||||
2. *`test_nested_multierrors` hangs indefinitely* —
|
|
||||||
the real, separate, forkserver-specific bug
|
|
||||||
captured by this doc.
|
|
||||||
|
|
||||||
The two symptoms are unrelated. The tree-kill / setpgrp
|
This is a **suite-hygiene fix** — it doesn't close the
|
||||||
fix direction proposed earlier would not help (1) (SC-
|
actual leak; it just stops the leak from blast-radiusing.
|
||||||
graceful-cleanup is the right answer there) and would
|
Zombie descendants still accumulate per run.
|
||||||
not help (2) (the hang is in the cancel cascade, not
|
|
||||||
in the hard-kill fallback).
|
|
||||||
|
|
||||||
## Symptom
|
## The real bug (unfixed)
|
||||||
|
|
||||||
Reproducer (py3.14, clean env):
|
`subint_forkserver_proc`'s teardown — `_ForkedProc.kill()`
|
||||||
|
(plain `os.kill(SIGKILL)` to the direct child pid) +
|
||||||
|
`proc.wait()` — does **not** reap grandchildren or
|
||||||
|
deeper descendants. When a cancellation test causes a
|
||||||
|
multi-level actor tree to tear down, the direct child
|
||||||
|
dies but its own children survive and get reparented to
|
||||||
|
init (PID 1), where they stay running with their
|
||||||
|
inherited FDs (including the registry listen socket).
|
||||||
|
|
||||||
|
**Symptom on repro:**
|
||||||
|
|
||||||
|
```
|
||||||
|
$ ss -tlnp 2>/dev/null | grep ':1616'
|
||||||
|
LISTEN 0 4096 127.0.0.1:1616 0.0.0.0:* \
|
||||||
|
users:(("subint-forkserv",pid=211595,fd=17),
|
||||||
|
("subint-forkserv",pid=211585,fd=17),
|
||||||
|
("subint-forkserv",pid=211583,fd=17),
|
||||||
|
("subint-forkserv",pid=211576,fd=17),
|
||||||
|
("subint-forkserv",pid=211572,fd=17))
|
||||||
|
|
||||||
|
$ for p in 211572 211576 211583 211585 211595; do
|
||||||
|
cat /proc/$p/cmdline | tr '\0' ' '; echo; done
|
||||||
|
./py314/bin/python -m pytest --spawn-backend=subint_forkserver \
|
||||||
|
tests/test_cancellation.py --timeout=30 --timeout-method=signal \
|
||||||
|
--tb=no -q --no-header
|
||||||
|
... (x5, all same cmdline — inherited from fork)
|
||||||
|
```
|
||||||
|
|
||||||
|
All 5 share the pytest cmdline because `os.fork()`
|
||||||
|
without `exec()` preserves the parent's argv. Their
|
||||||
|
comm-name (`subint-forkserv`) is the `thread_name` we
|
||||||
|
pass to the fork-worker thread in
|
||||||
|
`tractor.spawn._subint_forkserver.fork_from_worker_thread`.
|
||||||
|
|
||||||
|
## Why 5?
|
||||||
|
|
||||||
|
Not confirmed; guess is 5 = the parametrize cardinality
|
||||||
|
of one of the leaky tests (e.g. `test_some_cancels_all`
|
||||||
|
has 5 parametrize cases). Each param-case spawns a
|
||||||
|
nested tree; each leaks exactly one descendant. Worth
|
||||||
|
verifying by running each parametrize-case individually
|
||||||
|
and counting leaked procs per case.
|
||||||
|
|
||||||
|
## Ruled out
|
||||||
|
|
||||||
|
- **`:1616` collision from a different repo** (e.g.
|
||||||
|
piker): `/proc/$pid/cmdline` + `cwd` both resolve to
|
||||||
|
the tractor repo's `py314/` venv for all 5. These are
|
||||||
|
definitively spawned by our test run.
|
||||||
|
- **Parent-side `_ForkedProc.wait()` regressed**: the
|
||||||
|
direct child's teardown completes cleanly (exit-code
|
||||||
|
captured, `waitpid` returns); the 5 survivors are
|
||||||
|
deeper-descendants whose parent-side shim has no
|
||||||
|
handle on them. So the bug isn't in
|
||||||
|
`_ForkedProc.wait()` — it's in the lack of tree-
|
||||||
|
level descendant enumeration + reaping during nursery
|
||||||
|
teardown.
|
||||||
|
|
||||||
|
## Likely fix directions
|
||||||
|
|
||||||
|
1. **Process-group-scoped spawn + tree kill.** Put each
|
||||||
|
forkserver-spawned subactor into its own process
|
||||||
|
group (`os.setpgrp()` in the fork child), then on
|
||||||
|
teardown `os.killpg(pgid, SIGKILL)` to reap the
|
||||||
|
whole tree atomically. Simplest, most surgical.
|
||||||
|
2. **Subreaper registration.** Use
|
||||||
|
`PR_SET_CHILD_SUBREAPER` on the tractor root so
|
||||||
|
orphaned grandchildren reparent to the root rather
|
||||||
|
than init — then we can `waitpid` them from the
|
||||||
|
parent-side nursery teardown. More invasive.
|
||||||
|
3. **Explicit descendant enumeration at teardown.**
|
||||||
|
In `subint_forkserver_proc`'s finally block, walk
|
||||||
|
`/proc/<pid>/task/*/children` before issuing SIGKILL
|
||||||
|
to build a descendant-pid set; then kill + reap all
|
||||||
|
of them. Fragile (Linux-only, proc-fs-scan race).
|
||||||
|
|
||||||
|
Vote: **(1)** — clean, POSIX-standard, aligns with how
|
||||||
|
`subprocess.Popen` (and by extension `trio.lowlevel.
|
||||||
|
open_process`) handle tree-kill semantics on
|
||||||
|
kwargs-supplied `start_new_session=True`.
|
||||||
|
|
||||||
|
## Reproducer
|
||||||
|
|
||||||
```sh
|
```sh
|
||||||
# preflight: ensure clean env
|
# before: ensure clean env
|
||||||
ss -tlnp 2>/dev/null | grep ':1616' && echo 'FOUL — cleanup first!' || echo 'clean'
|
ss -tlnp 2>/dev/null | grep ':1616' || echo 'clean'
|
||||||
|
|
||||||
./py314/bin/python -m pytest --spawn-backend=subint_forkserver \
|
# run the leaky tests
|
||||||
'tests/test_cancellation.py::test_nested_multierrors[subint_forkserver]' \
|
./py314/bin/python -m pytest \
|
||||||
--timeout=30 --timeout-method=thread --tb=short -v
|
--spawn-backend=subint_forkserver \
|
||||||
|
tests/test_cancellation.py \
|
||||||
|
--timeout=30 --timeout-method=signal --tb=no -q --no-header
|
||||||
|
|
||||||
|
# observe: 5 leaked children now holding :1616
|
||||||
|
ss -tlnp 2>/dev/null | grep ':1616'
|
||||||
```
|
```
|
||||||
|
|
||||||
Expected: `pytest-timeout` fires at 30s with a thread-
|
Expected output: `subint-forkserv` processes listed as
|
||||||
dump banner, but the process itself **remains alive
|
listeners on `:1616`. Cleanup:
|
||||||
after timeout** and doesn't unwedge on subsequent
|
|
||||||
SIGINT. Requires SIGKILL to reap.
|
|
||||||
|
|
||||||
## Evidence (tree structure at hang point)
|
|
||||||
|
|
||||||
All 5 processes are kernel-level `S` (sleeping) in
|
|
||||||
`do_epoll_wait` (trio's event loop waiting on I/O):
|
|
||||||
|
|
||||||
|
```sh
|
||||||
|
pkill -9 -f \
|
||||||
|
"$(pwd)/py314/bin/python.*pytest.*spawn-backend=subint_forkserver"
|
||||||
```
|
```
|
||||||
PID PPID THREADS NAME ROLE
|
|
||||||
333986 1 2 subint-forkserv pytest main (the test body)
|
|
||||||
333993 333986 3 subint-forkserv "child 1" spawner subactor
|
|
||||||
334003 333993 1 subint-forkserv grandchild errorer under child-1
|
|
||||||
334014 333993 1 subint-forkserv grandchild errorer under child-1
|
|
||||||
333999 333986 1 subint-forkserv "child 2" spawner subactor (NO grandchildren!)
|
|
||||||
```
|
|
||||||
|
|
||||||
### Asymmetric tree depth
|
|
||||||
|
|
||||||
The test's `spawn_and_error(breadth=2, depth=3)` should
|
|
||||||
have BOTH direct children spawning 2 grandchildren
|
|
||||||
each, going 3 levels deep. Reality:
|
|
||||||
|
|
||||||
- Child 1 (333993, 3 threads) DID spawn its two
|
|
||||||
grandchildren as expected — fully booted trio
|
|
||||||
runtime.
|
|
||||||
- Child 2 (333999, 1 thread) did NOT spawn any
|
|
||||||
grandchildren — clearly never completed its
|
|
||||||
nursery's first `run_in_actor`. Its 1-thread state
|
|
||||||
suggests the runtime never fully booted (no trio
|
|
||||||
worker threads for `waitpid`/IPC).
|
|
||||||
|
|
||||||
This asymmetry is the key clue: the two direct
|
|
||||||
children started identically but diverged. Probably a
|
|
||||||
race around fork-inherited state (listener FDs,
|
|
||||||
subactor-nursery channel state) that happens to land
|
|
||||||
differently depending on spawn ordering.
|
|
||||||
|
|
||||||
### Parent-side state
|
|
||||||
|
|
||||||
Thread-dump of pytest main (333986) at the hang:
|
|
||||||
|
|
||||||
- Main trio thread — parked in
|
|
||||||
`trio._core._io_epoll.get_events` (epoll_wait on
|
|
||||||
its event loop). Waiting for IPC from children.
|
|
||||||
- Two trio-cache worker threads — each parked in
|
|
||||||
`outcome.capture(sync_fn)` calling
|
|
||||||
`os.waitpid(child_pid, 0)`. These are our
|
|
||||||
`_ForkedProc.wait()` off-loads. They're waiting for
|
|
||||||
the direct children to exit — but children are
|
|
||||||
stuck in their own epoll_wait waiting for IPC from
|
|
||||||
the parent.
|
|
||||||
|
|
||||||
**It's a deadlock, not a leak:** the parent is
|
|
||||||
correctly running `soft_kill(proc, _ForkedProc.wait,
|
|
||||||
portal)` (graceful IPC cancel via
|
|
||||||
`Portal.cancel_actor()`), but the children never
|
|
||||||
acknowledge the cancel message (or the message never
|
|
||||||
reaches them through the tangled post-fork IPC).
|
|
||||||
|
|
||||||
## What's NOT the cause (ruled out)
|
|
||||||
|
|
||||||
- **`_ForkedProc.kill()` only SIGKILLs direct pid /
|
|
||||||
missing tree-kill**: doesn't apply — we never reach
|
|
||||||
the hard-kill path. The deadlock is in the graceful
|
|
||||||
cancel cascade.
|
|
||||||
- **Port `:1616` contention**: ruled out after the
|
|
||||||
`reg_addr` fixture-wiring fix; each test session
|
|
||||||
gets a unique port now.
|
|
||||||
- **GIL starvation / SIGINT pipe filling** (class-A,
|
|
||||||
`subint_sigint_starvation_issue.md`): doesn't apply
|
|
||||||
— each subactor is its own OS process with its own
|
|
||||||
GIL (not legacy-config subint).
|
|
||||||
- **Child-side `_trio_main` absorbing KBI**: grep
|
|
||||||
confirmed; `_trio_main` only catches KBI at the
|
|
||||||
`trio.run()` callsite, which is reached only if the
|
|
||||||
trio loop exits normally. The children here never
|
|
||||||
exit trio.run() — they're wedged inside.
|
|
||||||
|
|
||||||
## Hypothesis: FD inheritance across nested forks
|
|
||||||
|
|
||||||
`subint_forkserver_proc` calls
|
|
||||||
`fork_from_worker_thread()` which ultimately does
|
|
||||||
`os.fork()` from a dedicated worker thread. Standard
|
|
||||||
Linux/POSIX fork semantics: **the child inherits ALL
|
|
||||||
open FDs from the parent**, including listener
|
|
||||||
sockets, epoll fds, trio wakeup pipes, and the
|
|
||||||
parent's IPC channel sockets.
|
|
||||||
|
|
||||||
At root-actor fork-spawn time, the root's IPC server
|
|
||||||
listener FDs are open in the parent. Those get
|
|
||||||
inherited by child 1. Child 1 then forkserver-spawns
|
|
||||||
its OWN subactor (grandchild). The grandchild
|
|
||||||
inherits FDs from child 1 — but child 1's address
|
|
||||||
space still contains **the root's IPC listener FDs
|
|
||||||
too** (inherited at first fork). So the grandchild
|
|
||||||
has THREE sets of FDs:
|
|
||||||
|
|
||||||
1. Its own (created after becoming a subactor).
|
|
||||||
2. Its direct parent child-1's.
|
|
||||||
3. The ROOT's (grandparent's) — inherited transitively.
|
|
||||||
|
|
||||||
IPC message routing may be ambiguous in this tangled
|
|
||||||
state. Or a listener socket that the root thinks it
|
|
||||||
owns is actually open in multiple processes, and
|
|
||||||
messages sent to it go to an arbitrary one. That
|
|
||||||
would exactly match the observed "graceful cancel
|
|
||||||
never propagates".
|
|
||||||
|
|
||||||
This hypothesis predicts the bug **scales with fork
|
|
||||||
depth**: single-level forkserver spawn
|
|
||||||
(`test_subint_forkserver_spawn_basic`) works
|
|
||||||
perfectly, but any test that spawns a second level
|
|
||||||
deadlocks. Matches observations so far.
|
|
||||||
|
|
||||||
## Fix directions (to validate)
|
|
||||||
|
|
||||||
### 1. `close_fds=True` equivalent in `fork_from_worker_thread()`
|
|
||||||
|
|
||||||
`subprocess.Popen` / `trio.lowlevel.open_process` have
|
|
||||||
`close_fds=True` by default on POSIX — they
|
|
||||||
enumerate open FDs in the child post-fork and close
|
|
||||||
everything except stdio + any explicitly-passed FDs.
|
|
||||||
Our raw `os.fork()` doesn't. Adding the equivalent to
|
|
||||||
our `_worker` prelude would isolate each fork
|
|
||||||
generation's FD set.
|
|
||||||
|
|
||||||
Implementation sketch in
|
|
||||||
`tractor.spawn._subint_forkserver.fork_from_worker_thread._worker`:
|
|
||||||
|
|
||||||
```python
|
|
||||||
def _worker() -> None:
|
|
||||||
pid: int = os.fork()
|
|
||||||
if pid == 0:
|
|
||||||
# CHILD: close inherited FDs except stdio + the
|
|
||||||
# pid-pipe we just opened.
|
|
||||||
keep: set[int] = {0, 1, 2, rfd, wfd}
|
|
||||||
import resource
|
|
||||||
soft, _ = resource.getrlimit(resource.RLIMIT_NOFILE)
|
|
||||||
os.closerange(3, soft) # blunt; or enumerate /proc/self/fd
|
|
||||||
# ... then child_target() as before
|
|
||||||
```
|
|
||||||
|
|
||||||
Problem: overly aggressive — closes FDs the
|
|
||||||
grandchild might legitimately need (e.g. its parent's
|
|
||||||
IPC channel for the spawn-spec handshake, if we rely
|
|
||||||
on that). Needs thought about which FDs are
|
|
||||||
"inheritable and safe" vs. "inherited by accident".
|
|
||||||
|
|
||||||
### 2. Cloexec on tractor's own FDs
|
|
||||||
|
|
||||||
Set `FD_CLOEXEC` on tractor-created sockets (listener
|
|
||||||
sockets, IPC channel sockets, pipes). This flag
|
|
||||||
causes automatic close on `execve`, but since we
|
|
||||||
`fork()` without `exec()`, this alone doesn't help.
|
|
||||||
BUT — combined with a child-side explicit close-
|
|
||||||
non-cloexec loop, it gives us a way to mark "my
|
|
||||||
private FDs" vs. "safe to inherit". Most robust, but
|
|
||||||
requires tractor-wide audit.
|
|
||||||
|
|
||||||
### 3. Explicit FD cleanup in `_ForkedProc`/`_child_target`
|
|
||||||
|
|
||||||
Have `subint_forkserver_proc`'s `_child_target`
|
|
||||||
closure explicitly close the parent-side IPC listener
|
|
||||||
FDs before calling `_actor_child_main`. Requires
|
|
||||||
being able to enumerate "the parent's listener FDs
|
|
||||||
that the child shouldn't keep" — plausible via
|
|
||||||
`Actor.ipc_server`'s socket objects.
|
|
||||||
|
|
||||||
### 4. Use `os.posix_spawn` with explicit `file_actions`
|
|
||||||
|
|
||||||
Instead of raw `os.fork()`, use `os.posix_spawn()`
|
|
||||||
which supports explicit file-action specifications
|
|
||||||
(close this FD, dup2 that FD). Cleaner semantics, but
|
|
||||||
probably incompatible with our "no exec" requirement
|
|
||||||
(subint_forkserver is a fork-without-exec design).
|
|
||||||
|
|
||||||
**Likely correct answer: (3) — targeted FD cleanup
|
|
||||||
via `actor.ipc_server` handle.** (1) is too blunt,
|
|
||||||
(2) is too wide-ranging, (4) changes the spawn
|
|
||||||
mechanism.
|
|
||||||
|
|
||||||
## Reproducer (standalone, no pytest)
|
|
||||||
|
|
||||||
```python
|
|
||||||
# save as /tmp/forkserver_nested_hang_repro.py (py3.14+)
|
|
||||||
import trio, tractor
|
|
||||||
|
|
||||||
async def assert_err():
|
|
||||||
assert 0
|
|
||||||
|
|
||||||
async def spawn_and_error(breadth: int = 2, depth: int = 1):
|
|
||||||
async with tractor.open_nursery() as n:
|
|
||||||
for i in range(breadth):
|
|
||||||
if depth > 0:
|
|
||||||
await n.run_in_actor(
|
|
||||||
spawn_and_error,
|
|
||||||
breadth=breadth,
|
|
||||||
depth=depth - 1,
|
|
||||||
name=f'spawner_{i}_{depth}',
|
|
||||||
)
|
|
||||||
else:
|
|
||||||
await n.run_in_actor(
|
|
||||||
assert_err,
|
|
||||||
name=f'errorer_{i}',
|
|
||||||
)
|
|
||||||
|
|
||||||
async def _main():
|
|
||||||
async with tractor.open_nursery() as n:
|
|
||||||
for i in range(2):
|
|
||||||
await n.run_in_actor(
|
|
||||||
spawn_and_error,
|
|
||||||
name=f'top_{i}',
|
|
||||||
breadth=2,
|
|
||||||
depth=1,
|
|
||||||
)
|
|
||||||
|
|
||||||
if __name__ == '__main__':
|
|
||||||
from tractor.spawn._spawn import try_set_start_method
|
|
||||||
try_set_start_method('subint_forkserver')
|
|
||||||
with trio.fail_after(20):
|
|
||||||
trio.run(_main)
|
|
||||||
```
|
|
||||||
|
|
||||||
Expected (current): hangs on `trio.fail_after(20)`
|
|
||||||
— children never ack the error-propagation cancel
|
|
||||||
cascade. Pattern: top 2 direct children, 4
|
|
||||||
grandchildren, 1 errorer deadlocks while trying to
|
|
||||||
unwind through its parent chain.
|
|
||||||
|
|
||||||
After fix: `trio.TooSlowError`-free completion; the
|
|
||||||
root's `open_nursery` receives the
|
|
||||||
`BaseExceptionGroup` containing the `AssertionError`
|
|
||||||
from the errorer and unwinds cleanly.
|
|
||||||
|
|
||||||
## Stopgap (landed)
|
|
||||||
|
|
||||||
Until the fix lands, `test_nested_multierrors` +
|
|
||||||
related multi-level-spawn tests can be skip-marked
|
|
||||||
under `subint_forkserver` via
|
|
||||||
`@pytest.mark.skipon_spawn_backend('subint_forkserver',
|
|
||||||
reason='...')`. Cross-ref this doc.
|
|
||||||
|
|
||||||
## References
|
## References
|
||||||
|
|
||||||
- `tractor/spawn/_subint_forkserver.py::fork_from_worker_thread`
|
|
||||||
— the primitive whose post-fork FD hygiene is
|
|
||||||
probably the culprit.
|
|
||||||
- `tractor/spawn/_subint_forkserver.py::subint_forkserver_proc`
|
|
||||||
— the backend function that orchestrates the
|
|
||||||
graceful cancel path hitting this bug.
|
|
||||||
- `tractor/spawn/_subint_forkserver.py::_ForkedProc`
|
- `tractor/spawn/_subint_forkserver.py::_ForkedProc`
|
||||||
— the `trio.Process`-compatible shim; NOT the
|
— the current teardown shim; PID-scoped, not tree-
|
||||||
failing component (confirmed via thread-dump).
|
scoped.
|
||||||
- `tests/test_cancellation.py::test_nested_multierrors`
|
- `tractor/spawn/_subint_forkserver.py::subint_forkserver_proc`
|
||||||
— the test that surfaced the hang.
|
— the spawn backend whose `finally` block needs the
|
||||||
|
tree-kill fix.
|
||||||
|
- `tests/test_cancellation.py` — the surface where the
|
||||||
|
leak surfaces.
|
||||||
- `ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md`
|
- `ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md`
|
||||||
— sibling hang class; probably same underlying
|
— sibling tracker for a different forkserver-teardown
|
||||||
fork-FD-inheritance root cause.
|
class (orphaned child doesn't respond to SIGINT); may
|
||||||
|
share root cause with this one once the fix lands.
|
||||||
- tractor issue #379 — subint backend tracking.
|
- tractor issue #379 — subint backend tracking.
|
||||||
|
|
|
||||||
|
|
@ -195,69 +195,6 @@ except ImportError:
|
||||||
_has_subints: bool = False
|
_has_subints: bool = False
|
||||||
|
|
||||||
|
|
||||||
def _close_inherited_fds(
|
|
||||||
keep: frozenset[int] = frozenset({0, 1, 2}),
|
|
||||||
) -> int:
|
|
||||||
'''
|
|
||||||
Close every open file descriptor in the current process
|
|
||||||
EXCEPT those in `keep` (default: stdio only).
|
|
||||||
|
|
||||||
Intended as the first thing a post-`os.fork()` child runs
|
|
||||||
after closing any communication pipes it knows about. This
|
|
||||||
is the fork-child FD hygiene discipline that
|
|
||||||
`subprocess.Popen(close_fds=True)` applies by default for
|
|
||||||
its exec-based children, but which we have to implement
|
|
||||||
ourselves because our `fork_from_worker_thread()` primitive
|
|
||||||
deliberately does NOT exec.
|
|
||||||
|
|
||||||
Why it matters
|
|
||||||
--------------
|
|
||||||
Without this, a forkserver-spawned subactor inherits the
|
|
||||||
parent actor's IPC listener sockets, trio-epoll fd, trio
|
|
||||||
wakeup-pipe, peer-channel sockets, etc. If that subactor
|
|
||||||
then itself forkserver-spawns a grandchild, the grandchild
|
|
||||||
inherits the FDs transitively from *both* its direct
|
|
||||||
parent AND the root actor — IPC message routing becomes
|
|
||||||
ambiguous and the cancel cascade deadlocks. See
|
|
||||||
`ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md`
|
|
||||||
for the full diagnosis + the empirical repro.
|
|
||||||
|
|
||||||
Fresh children will open their own IPC sockets via
|
|
||||||
`_actor_child_main()`, so they don't need any of the
|
|
||||||
parent's FDs.
|
|
||||||
|
|
||||||
Returns the count of fds that were successfully closed —
|
|
||||||
useful for sanity-check logging at callsites.
|
|
||||||
|
|
||||||
'''
|
|
||||||
# Enumerate open fds via `/proc/self/fd` on Linux (the fast +
|
|
||||||
# precise path); fall back to `RLIMIT_NOFILE` range close on
|
|
||||||
# other platforms. Matches stdlib
|
|
||||||
# `subprocess._posixsubprocess.close_fds` strategy.
|
|
||||||
try:
|
|
||||||
fd_names: list[str] = os.listdir('/proc/self/fd')
|
|
||||||
candidates: list[int] = [
|
|
||||||
int(n) for n in fd_names if n.isdigit()
|
|
||||||
]
|
|
||||||
except (FileNotFoundError, PermissionError):
|
|
||||||
import resource
|
|
||||||
soft, _ = resource.getrlimit(resource.RLIMIT_NOFILE)
|
|
||||||
candidates = list(range(3, soft))
|
|
||||||
|
|
||||||
closed: int = 0
|
|
||||||
for fd in candidates:
|
|
||||||
if fd in keep:
|
|
||||||
continue
|
|
||||||
try:
|
|
||||||
os.close(fd)
|
|
||||||
closed += 1
|
|
||||||
except OSError:
|
|
||||||
# fd was already closed (race with listdir) or
|
|
||||||
# otherwise unclosable — either is fine.
|
|
||||||
pass
|
|
||||||
return closed
|
|
||||||
|
|
||||||
|
|
||||||
def _format_child_exit(
|
def _format_child_exit(
|
||||||
status: int,
|
status: int,
|
||||||
) -> str:
|
) -> str:
|
||||||
|
|
@ -365,13 +302,9 @@ def fork_from_worker_thread(
|
||||||
pid: int = os.fork()
|
pid: int = os.fork()
|
||||||
if pid == 0:
|
if pid == 0:
|
||||||
# CHILD: close the pid-pipe ends (we don't use
|
# CHILD: close the pid-pipe ends (we don't use
|
||||||
# them here), then scrub ALL other inherited FDs
|
# them here), run the user callable if any, exit.
|
||||||
# so the child starts with a clean slate
|
|
||||||
# (stdio-only). Critical for multi-level spawn
|
|
||||||
# trees — see `_close_inherited_fds()` docstring.
|
|
||||||
os.close(rfd)
|
os.close(rfd)
|
||||||
os.close(wfd)
|
os.close(wfd)
|
||||||
_close_inherited_fds()
|
|
||||||
rc: int = 0
|
rc: int = 0
|
||||||
if child_target is not None:
|
if child_target is not None:
|
||||||
try:
|
try:
|
||||||
|
|
|
||||||
Loading…
Reference in New Issue