Commit Graph

67 Commits (eceed29d4a6b04edb5b04c555666944aac3b79d9)

Author SHA1 Message Date
Gud Boi eceed29d4a Pin forkserver hang to pytest `--capture=fd`
Sixth and final diagnostic pass — after all 4
cascade fixes landed (FD hygiene, pidfd wait,
`_parent_chan_cs` wiring, bounded peer-clear), the
actual last gate on
`test_nested_multierrors[subint_forkserver]`
turned out to be **pytest's default
`--capture=fd` stdout/stderr capture**, not
anything in the runtime cascade.

Empirical result: `pytest -s` → test PASSES in
6.20s. Default `--capture=fd` → hangs forever.

Mechanism: pytest replaces the parent's fds 1,2
with pipe write-ends it reads from. Fork children
inherit those pipes (since `_close_inherited_fds`
correctly preserves stdio). The error-propagation
cascade in a multi-level cancel test generates
7+ actors each logging multiple `RemoteActorError`
/ `ExceptionGroup` tracebacks — enough output to
fill Linux's 64KB pipe buffer. Writes block,
subactors can't progress, processes don't exit,
`_ForkedProc.wait` hangs.

Self-critical aside: I earlier tested w/ and w/o
`-s` and both hung, concluding "capture-pipe
ruled out". That was wrong — at that time fixes
1-4 weren't all in place, so the test was
failing at deeper levels long before reaching
the "produce lots of output" phase. Once the
cascade could actually tear down cleanly, enough
output flowed to hit the pipe limit. Order-of-
operations mistake: ruling something out based
on a test that was failing for a different
reason.

Deats,
- `subint_forkserver_test_cancellation_leak_issue
  .md`: new section "Update — VERY late: pytest
  capture pipe IS the final gate" w/ DIAG timeline
  showing `trio.run` fully returns, diagnosis of
  pipe-fill mechanism, retrospective on the
  earlier wrong ruling-out, and fix direction
  (redirect subactor stdout/stderr to `/dev/null`
  in fork-child prelude, conditional on
  pytest-detection or opt-in flag)
- `tests/test_cancellation.py`: skip-mark reason
  rewritten to describe the capture-pipe gate
  specifically; cross-refs the new doc section
- `tests/spawn/test_subint_forkserver.py`: the
  orphan-SIGINT test regresses back to xfail.
  Previously passed after the FD-hygiene fix,
  but the new `wait_for_no_more_peers(
  move_on_after=3.0)` bound in `async_main`'s
  teardown added up to 3s latency, pushing
  orphan-subactor exit past the test's 10s poll
  window. Real fix: faster orphan-side teardown
  OR extend poll window to 15s

No runtime code changes in this commit — just
test-mark adjustments + doc wrap-up.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-23 23:18:14 -04:00
Gud Boi 506617c695 Skip-mark + narrow `subint_forkserver` cancel hang
Two-part stopgap for the still-hanging
`test_nested_multierrors[subint_forkserver]`:

1. Skip-mark the test via
   `@pytest.mark.skipon_spawn_backend('subint_forkserver',
   reason=...)` so it stops blocking the test
   matrix while the remaining bug is being chased.
   The reason string cross-refs the conc-anal doc
   for full context.

2. Update the conc-anal doc
   (`subint_forkserver_test_cancellation_leak_issue.md`) with the
   empirical state after the three nested- cancel fix commits
   (`0cd0b633` FD scrub + `fe540d02` pidfd wait + `57935804` parent-chan
   shield break) landed, narrowing the remaining hang from "everything
   broken" to "peer-channel loops don't exit on `service_tn` cancel".

Deats from the DIAGDEBUG instrumentation pass,
- 80 `process_messages` ENTERs, 75 EXITs → 5 stuck
- ALL 40 `shield=True` ENTERs matched EXIT — the
  `_parent_chan_cs.cancel()` wiring from `57935804`
  works as intended for shielded loops.
- the 5 stuck loops are all `shield=False` peer-
  channel handlers in `handle_stream_from_peer`
  (inbound connections handled by
  `stream_handler_tn`, which IS `service_tn` in the
  current config).
- after `_parent_chan_cs.cancel()` fires, NEW
  shielded loops appear on the session reg_addr
  port — probably discovery-layer reconnection;
  doesn't block teardown but indicates the cascade
  has more moving parts than expected.

The remaining unknown: why don't the 5 peer-channel loops exit when
`service_tn.cancel_scope.cancel()` fires? They're not shielded, they're
inside the service_tn scope, a standard cancel should propagate through.
Some fork-config-specific divergence keeps them alive. Doc lists three
follow-up experiments (stackscope dump, side-by-side `trio_proc`
comparison, audit of the `tractor/ipc/_server.py:448` `except
trio.Cancelled:` path).

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-23 18:48:34 -04:00
Gud Boi 1af2121057 Wire `reg_addr` through leaky cancel tests
Stopgap companion to d0121960 (`subint_forkserver`
test-cancellation leak doc): five tests in
`tests/test_cancellation.py` were running against the
default `:1616` registry, so any leaked
`subint-forkserv` descendant from a prior test holds
the port and blows up every subsequent run with
`TooSlowError` / "address in use". Thread the
session-unique `reg_addr` fixture through so each run
picks its own port — zombies can no longer poison
other tests (they'll only cross-contaminate whatever
happens to share their port, which is now nothing).

Deats,
- add `reg_addr: tuple` fixture param to:
  - `test_cancel_infinite_streamer`
  - `test_some_cancels_all`
  - `test_nested_multierrors`
  - `test_cancel_via_SIGINT`
  - `test_cancel_via_SIGINT_other_task`
- explicitly pass `registry_addrs=[reg_addr]` to the
  two `open_nursery()` calls that previously had no
  kwargs at all (in `test_cancel_via_SIGINT` and
  `test_cancel_via_SIGINT_other_task`)
- add bounded `@pytest.mark.timeout(7, method='thread')`
  to `test_nested_multierrors` so a hung run doesn't
  wedge the whole session

Still doesn't close the real leak — the
`subint_forkserver` backend's `_ForkedProc.kill()` is
PID-scoped not tree-scoped, so grandchildren survive
teardown regardless of registry port. This commit is
just blast-radius containment until that fix lands.
See `ai/conc-anal/
subint_forkserver_test_cancellation_leak_issue.md`.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-23 18:48:34 -04:00
Gud Boi 4b2a0886c3 Mark `subint`-hanging tests with `skipon_spawn_backend`
Adopt the `@pytest.mark.skipon_spawn_backend('subint',
reason=...)` marker (a617b521) across the suites
reproducing the `subint` GIL-contention / starvation
hang classes doc'd in `ai/conc-anal/subint_*_issue.md`.

Deats,
- Module-level `pytestmark` on full-file-hanging suites:
  - `tests/test_cancellation.py`
  - `tests/test_inter_peer_cancellation.py`
  - `tests/test_pubsub.py`
  - `tests/test_shm.py`
- Per-test decorator where only one test in the file
  hangs:
  - `tests/discovery/test_registrar.py
    ::test_stale_entry_is_deleted` — replaces the
    inline `if start_method == 'subint': pytest.skip`
    branch with a declarative skip.
  - `tests/test_subint_cancellation.py
    ::test_subint_non_checkpointing_child`.
- A few per-test decorators are left commented-in-
  place as breadcrumbs for later finer-grained unskips.

Also, some nearby tidying in the affected files:
- Annotate loose fixture / test params
  (`pytest.FixtureRequest`, `str`, `tuple`, `bool`) in
  `tests/conftest.py`, `tests/devx/conftest.py`, and
  `tests/test_cancellation.py`.
- Normalize `"""..."""` → `'''...'''` docstrings per
  repo convention on a few touched tests.
- Add `timeout=6` / `timeout=10` to
  `@tractor_test(...)` on `test_cancel_infinite_streamer`
  and `test_some_cancels_all`.
- Drop redundant `spawn_backend` param from
  `test_cancel_via_SIGINT`; use `start_method` in the
  `'mp' in ...` check instead.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-23 18:47:49 -04:00
Gud Boi 93d99ed2eb Move `get_cpu_state()` to `conftest` as shared latency headroom
Factor the CPU-freq-scaling helper out of
`test_legacy_one_way_streaming` into `conftest.py`
alongside a new `cpu_scaling_factor()` convenience fn
that returns a latency-headroom multiplier (>= 1.0).

Apply it to the two other flaky-timeout tests,
- `test_cancel_via_SIGINT_other_task`: 2s -> scaled
- `test_example[we_are_processes.py]`: 16s -> scaled

Deats,
- add `get_cpu_state()` + `cpu_scaling_factor()` to
  `conftest.py` so all test mods can share the logic.
- catch `IndexError` (empty glob) in addition to
  `FileNotFoundError`.
- rename `factor` var -> `headroom` at call sites for
  clarity on intent.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-02 17:59:13 -04:00
Gud Boi 066011b83d Bump `fail_after` delay on non-linux for sync-sleep test
Use 6s timeout on non-linux (vs 4s) in
`test_cancel_while_childs_child_in_sync_sleep()` to avoid
flaky `TooSlowError` on slower CI runners.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-03-13 21:10:52 -04:00
Gud Boi 6ee0149e8d Another cancellation test timeout bump for non-linux 2026-03-09 19:46:42 -04:00
Tyler Goodlet 88c1c083bd Add timeout to inf-streamer test 2025-08-18 13:31:15 -04:00
Tyler Goodlet b096867d40 Remove lingering seg=False-flags from tests 2025-08-18 12:03:32 -04:00
Tyler Goodlet d2ac9ecf95 Resolve `test_cancel_while_childs_child_in_sync_sleep`
Was failing due to the `.fail_after()` timeout being *too short* and
somehow the new interplay of that with strict-exception groups resulting
in the `TooSlowError` never raising but instead an eg with the embedded
`AssertionError`?? I still don't really get it honestly..

I've written up lengthy notes around the different `delay` settings that
can be used to see the diff outcomes, the failing case being the one
i still don't really grok and think is justification for `trio` to
bubble inner `Cancelled`s differently possibly?

For now i've included the original failing case as an `xfail`
parametrization for now which will hopefully drive a follow lowlevel
`trio` test in `test_trioisms`!
2025-08-18 10:46:37 -04:00
Tyler Goodlet 8218f0f51f Bit of multi-line styling / name tweaks in cancellation suites 2025-08-18 10:46:37 -04:00
Tyler Goodlet 8f7c022afe Various test tweaks related to 3.13 egs
Including changes like,
- loose eg flagging in various test emedded `trio.open_nursery()`s.
- changes to eg handling (like using `except*`).
- added `debug_mode` integration to tests that needed some REPLin
  in order to figure out appropriate updates.
2025-03-27 13:38:47 -04:00
Tyler Goodlet 41a3297b9f Tweak some test asserts to better `is` style 2025-03-27 13:24:25 -04:00
Tyler Goodlet 683288c8db Update tests for `PldRx` and `Context` changes
Mostly adjustments for the new pld-receiver semantics/shim-layer which
results more often in the direct delivery of `RemoteActorError`s from
IPC API primitives (like `Portal.result()`) instead of being embedded in
an `ExceptionGroup` bundled from an embedded nursery.

Tossed usage of the `debug_mode: bool` fixture to a couple problematic
tests while i was working on them.

Also includes detailed assertion updates to the inter-peer cancellation
suite in terms of,
- `Context.canceller` state correctly matching the true src actor when
  expecting a ctxc.
- any rxed `ContextCancelled` should instance match the `Context._local/remote_error`
  as should the `.msgdata` and `._ipc_msg`.
2025-03-24 14:04:51 -04:00
Tyler Goodlet dd9fe0b043 Add `tests/__init__.py` for `.conftest` imports
I must have had a local touched file but never committed or something?
Seems that new `pytest` requires a top level `tests` pkg in order for
relative `.conftest` imports to work.
2025-03-20 20:53:54 -04:00
Tyler Goodlet 5bf550b64a Adjust all `RemoteActorError.type` using tests
To instead use the new `.boxed_type` B)
2025-03-20 20:35:02 -04:00
Tyler Goodlet dec2b1f0f5 Reapply "Port all tests to new `reg_addr` fixture name"
This reverts-the-revert of commit
bc13599e1f which was needed to land pre
`multihomed` feat branch history.
2025-03-20 19:50:31 -04:00
Tyler Goodlet bc13599e1f Revert "Port all tests to new `reg_addr` fixture name"
This reverts commit 715348c5c2.
2025-03-19 15:34:30 -04:00
Tyler Goodlet a87df3009f Drop now-deprecated deps on modern `trio`/Python
- `trio_typing` is nearly obsolete since `trio >= 0.23`
- `exceptiongroup` is built-in to python 3.11
- `async_generator` primitives have lived in `contextlib` for quite
  a while!
2025-03-16 16:06:24 -04:00
Tyler Goodlet 389b305d3b Add (back) a `tractor._testing` sub-pkg
Since importing from our top level `conftest.py` is not scaleable
or as "future forward thinking" in terms of:
- LoC-wise (it's only one file),
- prevents "external" (aka non-test) example scripts from importing
  content easily,
- seemingly(?) can't be used via abs-import if using
  a `[tool.pytest.ini_options]` in a `pyproject.toml` vs.
  a `pytest.ini`, see:
  https://docs.pytest.org/en/8.0.x/reference/customize.html#pyproject-toml)

=> Go back to having an internal "testing" pkg like `trio` (kinda) does.

Deats:
- move generic top level helpers into pkg-mod including the new
  `expect_ctxc()` (which i needed in the advanced faults testing script.
- move `@tractor_test` into `._testing.pytest` sub-mod.
- adjust all the helper imports to be a `from tractor._testing import <..>`

Rework `test_ipc_channel_break_during_stream()` and backing script:
- make test(s) pull `debug_mode` from new fixture (which is now
  controlled manually from `--tpdb` flag) and drop the previous
  parametrized input.
- update logic in ^ test for "which-side-fails" cases to better match
  recently updated/stricter cancel/failure semantics in terms of
  `ClosedResouruceError` vs. `EndOfChannel` expectations.
- handle `ExceptionGroup`s with expected embedded errors in test.
- better pendantics around whether to expect a user simulated KBI.
- for `examples/advanced_faults/ipc_failure_during_stream.py` script:
  - generalize ipc breakage in new `break_ipc()` with support for diff
    internal `trio` methods and a #TODO for future disti frameworks
  - only make one sub-actor task break and the other just stream.
  - use new `._testing.expect_ctxc()` around ctx block.
  - add a bit of exception handling with `print()`s around ctxc (unused
    except if 'msg' break method is set) and eoc cases.
  - don't break parent side ipc in loop any more then once
    after first break, checked via flag var.
  - add a `pre_close: bool` flag to control whether
    `MsgStreama.aclose()` is called *before* any ipc breakage method.

Still TODO:
- drop `pytest.ini` and add the alt section to `pyproject.py`.
 -> currently can't get `--rootdir=` opt to work.. not showing in
   console header.
 -> ^ also breaks on 'tests' `enable_modules` imports in subactors
   during discovery tests?
2025-03-16 15:28:28 -04:00
Tyler Goodlet 664ae87588 Make `@context`-cancelled tests more pedantic
In order to match a very significant and coming-soon patch set to the
IPC `Context` and `Channel` cancellation semantics with significant but
subtle changes to the primitives and runtime logic:

- a new set of `Context` state pub meth APIs for checking exact
  inter-actor-linked-task outcomes such as `.outcome`, `.maybe_error`,
  and `.cancel_acked`.

- trying to move away from `Context.cancelled_caught` usage since the
  semantics from `trio` don't really map well (in terms of cancel
  requests and how they result in cancel-scope graceful closure) and
  `.cancel_acked: bool` is a better approach for IPC req-resp msging.
  - change test usage to access `._scope.cancelled_caught` directly.

- more pedantic ctxc-raising expects around the "type of self
  cancellation" and final outcome in ctxc cases:
  - `ContextCancelled` is raised by ctx (`Context.result()`) consumer
    methods when `Portal.cancel_actor()` is called (since it's an
    out-of-band request) despite `Channel._cancel_called` being set.
  - also raised by `.open_context().__aexit__()` on close.
  - `.outcome` is always `.maybe_error` is always one of
    `._local/remote_error`.
2025-03-14 22:18:31 -04:00
Tyler Goodlet 715348c5c2 Port all tests to new `reg_addr` fixture name 2025-03-14 13:42:15 -04:00
Tyler Goodlet 347591c348 Expect egs in tests which retreive portal results 2022-10-14 19:42:23 -04:00
Tyler Goodlet 0f523b65fb Change cancel test over the exception group 2022-10-14 18:16:51 -04:00
Tyler Goodlet d24fae8381 'Rename mp spawn methods to have a `'mp_'` prefix' 2022-10-09 17:54:55 -04:00
Tyler Goodlet b3ff4b7804 Increase some timeouts for windows 2022-01-21 12:20:06 -05:00
Tyler Goodlet d65912e1ae Increase kbi delay in remote cancel test 2021-12-17 09:38:04 -05:00
Tyler Goodlet a29924f330 Don't assume exception order from nursery 2021-12-02 08:45:58 -05:00
Tyler Goodlet 16a3321a38 Increase timeout for windows.. 2021-11-29 21:52:30 -05:00
Tyler Goodlet 121f7fd844 Draft test that shows a slow daemon cancellation
Currently if the spawn task is waiting on a daemon actor it is likely in
`await proc.wait()`, however, if the actor nursery is subsequently
cancelled this checkpoint will be abandoned and the hard proc reaping
sequence will execute which results in a up to 3 second wait before
a "hard" system signal is sent to the child.  Ideally such
a cancelled-during-daemon-actor-wait condition is instead handled by
first trying to cancel the remote actor using `Portal.cancel_actor()` (a
"graceful" remote cancel request) which should (presuming normal runtime
operation) result in an immediate collection of the process after normal
actor (remotely triggered) runtime cancellation.
2021-11-29 16:03:14 -05:00
Tyler Goodlet 4f222a5f9c Use type match of expected error 2021-10-15 10:25:50 -04:00
Tyler Goodlet 533457c64d Handle nested multierror case on windows 2021-10-15 09:16:51 -04:00
Tyler Goodlet 7ee121aeaf Try to handle variable windows errors 2021-10-14 13:39:46 -04:00
Tyler Goodlet b372f4c92b Handle top level multierror that presents now? 2021-07-02 11:55:16 -04:00
Tyler Goodlet 2efd8ed167 Drop run and rpc_module_paths from cancel tests 2021-05-07 11:21:40 -04:00
Tyler Goodlet 2498a4963b Update all tests to new streaming API 2021-04-28 12:23:14 -04:00
Tyler Goodlet 1f1619c730 Convert all test suite sync funcs 2021-04-27 12:08:30 -04:00
Tyler Goodlet 0eba5f4708 Port remaining tests to pass func refs 2020-12-22 10:39:47 -05:00
Tyler Goodlet a668f714d5 Allow passing function refs to `Portal.run()`
This resolves and completes #69 allowing all RPC invocation APIs to pass
function references directly instead of explicit `str` names for the
target namespace and function (this is still done implicitly
underneath).  This brings us closer to `trio`'s task running API as well
as acknowledges that any inter-host RPC system (and API) will likely
need to be implemented on top of local RPC primitives anyway. Even if
this ends up **not** being true we can always go to "function stubs" as
part of our IAC protocol or, add a new method to do explicit namespace
calls: `.run_from_module()` or whatever everyone votes on.

Resolves #69

Further, this commit drops `Actor.statespace` from the entire system
since a user can easily get this same functionality using module
level variables. Fix docs to match all these changes (luckily mostly
already done due to example scripts referencing).
2020-12-21 09:09:55 -05:00
Tyler Goodlet 1b6ee2ecf6 Skip sync sleep test on windows 2020-10-13 15:26:46 -04:00
Tyler Goodlet 666966097a Revert "Change to relative conftest.py imports"
This reverts commit 2b53c74b1c.
2020-10-13 14:42:02 -04:00
Tyler Goodlet 24ef919334 Skip sync sleep test on mp backend 2020-10-13 14:16:20 -04:00
Tyler Goodlet 0e344eead8 Add a "cancel arrives during a sync sleep in child" test
This appears to demonstrate the same bug found in #156. It looks like
cancelling a subactor with a child, while that child is running sync code,
can result in the child never getting cancelled due to some strange
condition where the internal nurseries aren't being torn down as
expected when a `trio.Cancelled` is raised.
2020-10-12 23:25:22 -04:00
Tyler Goodlet 2b53c74b1c Change to relative conftest.py imports 2020-10-05 11:58:58 -04:00
Tyler Goodlet da56d0f043 Add slight delays to SIGINT tests on mp 2020-07-29 13:27:15 -04:00
Tyler Goodlet e8a38e4d15 Fix cancelled type handling 2020-07-27 11:15:05 -04:00
Tyler Goodlet 3c7ec72f8e Fix SIGINT test names 2020-07-26 23:37:44 -04:00
Tyler Goodlet 5a27065a10 Finally tame the super flaky tests
- ease up on first stream test run deadline
- skip streaming tests in CI for mp backend, period
- give up on > 1 depth nested spawning with mp
- completely give up on slow spawning on windows
2020-07-26 22:53:40 -04:00
Tyler Goodlet 891edbab5f Run the trio spawner in nested tests 2020-07-25 18:19:17 -04:00
Tyler Goodlet dddbeb0e71 Run Windows on trio and mp backends
The new pure trio spawning backend uses `subprocess` internally which is
also supported on windows so let's run it in CI.
2020-07-25 13:41:48 -04:00