tractor

Commit Graph

Author	SHA1	Message	Date
Gud Boi	4c133ab541	Default `pytest` to use `--capture=sys` Lands the capture-pipe workaround from the prior cluster of diagnosis commits: switch pytest's `--capture` mode from the default `fd` (redirects fd 1,2 to temp files, which fork children inherit and can deadlock writing into) to `sys` (only `sys.stdout` / `sys.stderr` — fd 1,2 left alone). Trade-off documented inline in `pyproject.toml`: - LOST: per-test attribution of raw-fd output (C-ext writes, `os.write(2, ...)`, subproc stdout). Still goes to terminal / CI capture, just not per-test-scoped in the failure report. - KEPT: `print()` + `logging` capture per-test (tractor's logger uses `sys.stderr`). - KEPT: `pytest -s` debugging behavior. This allows us to re-enable `test_nested_multierrors` without skip-marking + clears the class of pytest-capture-induced hangs for any future fork-based backend tests. Deats, - `pyproject.toml`: `'--capture=sys'` added to `addopts` w/ ~20 lines of rationale comment cross-ref'ing the post-mortem doc - `test_cancellation`: drop `skipon_spawn_backend('subint_forkserver')` from `test_nested_ multierrors` — no longer needed. * file-level `pytestmark` covers any residual. - `tests/spawn/test_subint_forkserver.py`: orphan-SIGINT test's xfail mark loosened from `strict=True` to `strict=False` + reason rewritten. * it passes in isolation but is session-env-pollution sensitive (leftover subactor PIDs competing for ports / inheriting harness FDs). * tolerate both outcomes until suite isolation improves. - `test_shm`: extend the existing `skipon_spawn_backend('subint', ...)` to also skip `'subint_forkserver'`. * Different root cause from the cancel-cascade class: `multiprocessing.SharedMemory`'s `resource_tracker` + internals assume fresh- process state, don't survive fork-without-exec cleanly - `tests/discovery/test_registrar.py`: bump timeout 3→7s on one test (unrelated to forkserver; just a flaky-under-load bump). - `tractor.spawn._subint_forkserver`: inline comment-only future-work marker right before `_actor_child_main()` describing the planned conditional stdout/stderr-to-`/dev/null` redirect for cases where `--capture=sys` isn't enough (no code change — the redirect logic itself is deferred). EXTRA NOTEs ----------- The `--capture=sys` approach is the minimum- invasive fix: just a pytest ini change, no runtime code change, works for all fork-based backends, trade-offs well-understood (terminal-level capture still happens, just not pytest's per-test attribution of raw-fd output). (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-24 14:17:23 -04:00
Gud Boi	eceed29d4a	Pin forkserver hang to pytest `--capture=fd` Sixth and final diagnostic pass — after all 4 cascade fixes landed (FD hygiene, pidfd wait, `_parent_chan_cs` wiring, bounded peer-clear), the actual last gate on `test_nested_multierrors[subint_forkserver]` turned out to be pytest's default `--capture=fd` stdout/stderr capture, not anything in the runtime cascade. Empirical result: `pytest -s` → test PASSES in 6.20s. Default `--capture=fd` → hangs forever. Mechanism: pytest replaces the parent's fds 1,2 with pipe write-ends it reads from. Fork children inherit those pipes (since `_close_inherited_fds` correctly preserves stdio). The error-propagation cascade in a multi-level cancel test generates 7+ actors each logging multiple `RemoteActorError` / `ExceptionGroup` tracebacks — enough output to fill Linux's 64KB pipe buffer. Writes block, subactors can't progress, processes don't exit, `_ForkedProc.wait` hangs. Self-critical aside: I earlier tested w/ and w/o `-s` and both hung, concluding "capture-pipe ruled out". That was wrong — at that time fixes 1-4 weren't all in place, so the test was failing at deeper levels long before reaching the "produce lots of output" phase. Once the cascade could actually tear down cleanly, enough output flowed to hit the pipe limit. Order-of- operations mistake: ruling something out based on a test that was failing for a different reason. Deats, - `subint_forkserver_test_cancellation_leak_issue .md`: new section "Update — VERY late: pytest capture pipe IS the final gate" w/ DIAG timeline showing `trio.run` fully returns, diagnosis of pipe-fill mechanism, retrospective on the earlier wrong ruling-out, and fix direction (redirect subactor stdout/stderr to `/dev/null` in fork-child prelude, conditional on pytest-detection or opt-in flag) - `tests/test_cancellation.py`: skip-mark reason rewritten to describe the capture-pipe gate specifically; cross-refs the new doc section - `tests/spawn/test_subint_forkserver.py`: the orphan-SIGINT test regresses back to xfail. Previously passed after the FD-hygiene fix, but the new `wait_for_no_more_peers( move_on_after=3.0)` bound in `async_main`'s teardown added up to 3s latency, pushing orphan-subactor exit past the test's 10s poll window. Real fix: faster orphan-side teardown OR extend poll window to 15s No runtime code changes in this commit — just test-mark adjustments + doc wrap-up. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 23:18:14 -04:00
Gud Boi	c20b05e181	Use `pidfd` for cancellable `_ForkedProc.wait` Two coordinated improvements to the `subint_forkserver` backend: 1. Replace `trio.to_thread.run_sync(os.waitpid, ..., abandon_on_cancel=False)` in `_ForkedProc.wait()` with `trio.lowlevel.wait_readable(pidfd)`. The prior version blocked a trio cache thread on a sync syscall — outer cancel scopes couldn't unwedge it when something downstream got stuck. Same pattern `trio.Process.wait()` and `proc_waiter` (the mp backend) already use. 2. Drop the `@pytest.mark.xfail(strict=True)` from `test_orphaned_subactor_sigint_cleanup_DRAFT` — the test now PASSES after `0cd0b633` (fork-child FD scrub). Same root cause as the nested-cancel hang: inherited IPC/trio FDs were poisoning the child's event loop. Closing them lets SIGINT propagation work as designed. Deats, - `_ForkedProc.__init__` opens a pidfd via `os.pidfd_open(pid)` (Linux 5.3+, Python 3.9+) - `wait()` parks on `trio.lowlevel.wait_readable()`, then non-blocking `waitpid(WNOHANG)` to collect the exit status (correct since the pidfd signal IS the child-exit notification) - `ChildProcessError` swallow handles the rare race where someone else reaps first - pidfd closed after `wait()` completes (one-shot semantics) + `__del__` belt-and-braces for unexpected-teardown paths - test docstring's `@xfail` block replaced with a `# NOTE` comment explaining the historical context + cross-ref to the conc-anal doc; test remains in place as a regression guard The two changes are interdependent — the cancellable `wait()` matters for the same nested- cancel scenarios the FD scrub fixes, since the original deadlock had trio cache workers wedged in `os.waitpid` swallowing the outer cancel. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:48:34 -04:00
Gud Boi	1e357dcf08	Mv `test_subint_cancellation.py` to `tests/spawn/` subpkg Also, some slight touchups in `.spawn._subint`.	2026-04-23 18:48:34 -04:00
Gud Boi	f5f37b69e6	Shorten some timeouts in `subint_forkserver` suites	2026-04-23 18:48:34 -04:00
Gud Boi	a72deef709	Refine `subint_forkserver` orphan-SIGINT diagnosis Empirical follow-up to the xfail'd orphan-SIGINT test: the hang is not "trio can't install a handler on a non-main thread" (the original hypothesis from the `child_sigint` scaffold commit). On py3.14: - `threading.current_thread() is threading.main_thread()` IS True post-fork — CPython re-designates the fork-inheriting thread as "main" correctly - trio's `KIManager` SIGINT handler IS installed in the subactor (`signal.getsignal(SIGINT)` confirms) - the kernel DOES deliver SIGINT to the thread But `faulthandler` dumps show the subactor wedged in `trio/_core/_io_epoll.py::get_events` — trio's wakeup-fd mechanism (which turns SIGINT into an epoll-wake) isn't firing. So the `except KeyboardInterrupt` at `tractor/spawn/_entry.py::_trio_main:164` — the runtime's intentional "KBI-as-OS-cancel" path — never fires. Deats, - new `ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md` (+385 LOC): full writeup — TL;DR, symptom reproducer, the "intentional cancel path" the bug defeats, diagnostic evidence (`faulthandler` output + `getsignal` probe), ruled-out hypotheses (non-main-thread issue, wakeup-fd inheritance, KBI-as-trio-check-exception), and fix directions - `test_orphaned_subactor_sigint_cleanup_DRAFT` xfail `reason` + test docstring rewritten to match the refined understanding — old wording blamed the non-main-thread path, new wording points at the `epoll_wait` wedge + cross-refs the new conc-anal doc - `_subint_forkserver` module docstring's `child_sigint='trio'` bullet updated: now notes trio's handler is already correctly installed, so the flag may end up a no-op / doc-only mode once the real root cause is fixed Closing the gap aligns with existing design intent (make the already-designed "KBI-as-OS-cancel" behavior actually fire), not a new feature. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:48:34 -04:00
Gud Boi	76605d5609	Add DRAFT `subint_forkserver` orphan-SIGINT test Tier-4 test `test_orphaned_subactor_sigint_cleanup_DRAFT` documents an empirical SIGINT-delivery gap in the `subint_forkserver` backend: when the parent dies via `SIGKILL` (no IPC `Portal.cancel_actor()` possible) and `SIGINT` is sent to the orphan child, the child DOES NOT unwind — CPython's default `KeyboardInterrupt` is delivered to `threading.main_thread()`, whose tstate is dead in the post-fork child bc fork inherited the worker thread, not main. Trio running on the fork-inherited worker thread therefore never observes the signal. Marked `xfail(strict=True)` so the mark flips to XPASS→fail once the backend grows explicit SIGINT plumbing. Deats, - harness runs the failure-mode sequence out-of-process: 1. harness subprocess runs a fresh Python script that calls `try_set_start_method('subint_forkserver')` then opens a root actor + one `sleep_forever` subactor 2. parse `PARENT_READY=<pid>` + `CHILD_PID=<pid>` markers off harness `stdout` to confirm IPC handshake completed 3. `SIGKILL` the parent, `proc.wait()` to reap the zombie (otherwise `os.kill(pid, 0)` keeps reporting it alive) 4. assert the child survived the parent-reap (i.e. was actually orphaned, not reaped too) before moving on 5. `SIGINT` the orphan child, poll `os.kill(child_pid, 0)` every 100ms for up to 10s - supporting helpers: `_read_marker()` with per-proc bytes-buffer to carry partial lines across calls, `_process_alive()` liveness probe via `kill(pid, 0)` - Linux-only via `platform.system() != 'Linux'` skip — orphan-reparenting semantics don't generalize to other platforms - port offset (`reg_addr[1] + 17`) so the harness listener doesn't race concurrently-running backend tests - best-effort `finally:` cleanup: `SIGKILL` any still-alive pids + `proc.kill()` + bounded `proc.wait()` to avoid leaking orphans across the session Also, tier-4 header comment documents the cross-backend generalization path: applicable to any multi-process backend (`trio`, `mp_spawn`, `mp_forkserver`, `subint_forkserver`), NOT to plain `subint` (in-process subints have no orphan OS-child). Move path: lift harness into `tests/_orphan_harness.py`, parametrize on session `_spawn_method`, add `skipif _spawn_method == 'subint'`. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:48:34 -04:00
Gud Boi	26914fde75	Wire `subint_forkserver` as first-class backend Promote `_subint_forkserver` from primitives-only into a registered spawn backend: `'subint_forkserver'` is now a `SpawnMethodKey` literal, dispatched via `_methods` to the new `subint_forkserver_proc()` target, feature-gated under the existing `subint`-family py3.14+ case, and selectable via `--spawn-backend=subint_forkserver`. Deats, - new `subint_forkserver_proc()` spawn target in `_subint_forkserver`: - mirrors `trio_proc()`'s supervision model — real OS subprocess so `Portal.cancel_actor()` + `soft_kill()` on graceful teardown, `os.kill(SIGKILL)` on hard-reap (no `_interpreters.destroy()` race to fuss over bc the child lives in its own process) - only real diff from `trio_proc` is the spawn mechanism: fork from a main-interp worker thread via `fork_from_worker_thread()` (off-loaded to trio's thread pool) instead of `trio.lowlevel.open_process()` - child-side `_child_target` closure runs `tractor._child._actor_child_main()` with `spawn_method='trio'` — the child is a regular trio actor, "subint_forkserver" names how the parent spawned, not what the child runs - new `_ForkedProc` class — thin `trio.Process`-compatible shim around a raw OS pid: `.poll()` via `waitpid(WNOHANG)`, async `.wait()` off-loaded to a trio cache thread, `.kill()` via `SIGKILL`, `.returncode` cached for repeat calls. `.stdin`/`.stdout`/`.stderr` are `None` (fork-w/o-exec inherits parent FDs; we don't marshal them) which matches `soft_kill()`'s `is not None` guards Also, new backend-tier test `test_subint_forkserver_spawn_basic` drives the registered backend end-to-end via `open_root_actor` + `open_nursery` + `run_in_actor` w/ a trivial portal-RPC round-trip. Uses a `forkserver_spawn_method` fixture to flip `_spawn_method`/`_ctx` for the test's duration + restore on teardown (so other session-level tests don't observe the global flip). Test module docstring reworked to describe the three tiers now covered: (1) primitive-level, (2) parent-trio-driven primitives, (3) full registered backend. Status: still-open work (tracked on `tractor#379`) doc'd inline in the module docstring — no cancel/hard-kill stress coverage yet, child-side subint-hosted root runtime still future (gated on `msgspec#563`), thread-hygiene audit pending the same unblock. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:48:34 -04:00
Gud Boi	25e400d526	Add trio-parent tests for `_subint_forkserver` New pytest module `tests/spawn/test_subint_forkserver.py` drives the forkserver primitives from inside a real `trio.run()` in the parent — the runtime shape tractor will actually use when we wire up a `subint_forkserver` spawn backend proper. Complements the standalone no-trio-in-parent `ai/conc-anal/subint_fork_from_main_thread_smoketest.py`. Deats, - new test pkg `tests/spawn/` (+ empty `__init__.py`) - two tests, both `@pytest.mark.timeout(30, method='thread')` for the GIL-hostage safety reason doc'd in `ai/conc-anal/subint_sigint_starvation_issue.md`: - `test_fork_from_worker_thread_via_trio` — parent-side plumbing baseline. `trio.run()` off-loads forkserver prims via `trio.to_thread.run_sync()` + asserts the child reaps cleanly - `test_fork_and_run_trio_in_child` — end-to-end: forked child calls `run_subint_in_worker_thread()` with a bootstrap str that does `trio.run()` in a fresh subint - both tests wrap the inner `trio.run()` in a `dump_on_hang()` for post-mortem if the outer `pytest-timeout` fires - intentionally NOT using `--spawn-backend` — the tests drive the primitives directly rather than going through tractor's spawn-method registry (which the forkserver isn't plugged into yet) Also, rename `run_trio_in_subint()` → `run_subint_in_worker_thread()` for naming consistency with the sibling `fork_from_worker_thread()`. The action is really "host a subint on a worker thread", not specifically "run trio" — trio just happens to be the typical payload. Propagate the rename to the smoketest. Further, add a "TODO — cleanup gated on msgspec PEP 684 support" section to the `_subint_forkserver` module docstring: flags the dedicated-`threading.Thread` design as potentially-revisable once isolated-mode subints are viable in tractor. Cross-refs `msgspec#563` + `tractor#379` and points at an audit-plan conc-anal doc we'll add next. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code	2026-04-23 18:48:34 -04:00

9 Commits (2ca0f41e61cbb980b5a0d7863a4b0f801552e95f)