Compare commits

..

No commits in common. "f42dfe50a9bb679b56f4f11fc5fe43cf182c4801" and "3a45dbd5039aa4559ea38293952415332b795ac9" have entirely different histories.

7 changed files with 33 additions and 195 deletions

View File

@ -1,102 +0,0 @@
# `trio` 0.29 -> 0.33 slows the depth=3 cancel-cascade
## Symptom
After locking to `trio==0.33.0` (commit `c7741bba`, was
`0.29.0`), this test reliably trips its `fail_after`
deadline on the **`trio`** backend:
```
FAILED tests/test_cancellation.py::test_nested_multierrors[start_method=trio-depth=3]
- AssertionError: assert False
where False = isinstance(
Cancelled(source='deadline', source_task=None, reason=None),
tractor.RemoteActorError,
)
```
A `fail_after_w_trace` hang-snapshot is captured for the
test each run (deadline-injected `Cancelled` wrapped into
the actor-nursery `BaseExceptionGroup`).
## Root cause (immediate)
The test budgets `fail_after(6)` for the `trio` backend.
That 6s was chosen (commit `32955db0`, while `trio==0.29`)
with the assertion that trio finishes "well under" 6s.
The `trio` 0.29 -> 0.33 bump slowed the depth=3 cascade
past that budget, so the 6s deadline now fires mid-cascade.
trio 0.33 added **cancel-reason tracking** — every
`Cancelled` now carries `(source=, reason=, source_task=)`.
The injected exc is `Cancelled(source='deadline')`, i.e.
trio itself naming our `fail_after(6)` scope as the cancel
origin. When that `Cancelled` collapses one branch of the
nursery BEG, the test's `isinstance(subexc,
RemoteActorError)` assertion fails. The healthy outcome is
`BEG = [RemoteActorError, RemoteActorError]`; the
`Cancelled` is purely an artifact of the deadline cutting
the cascade short.
## Measurements (standalone, this machine)
```
depth=1 trio ~3.15s PASS (keeps 6s budget)
depth=3 trio ~6.8-8.2s FAIL @ 6s (now bumped to 12s)
```
depth=1 still fits comfortably; only depth=3 (deeper
recursive spawn-and-error tree => more actors to reap)
exceeds the old budget. The ~2s/depth-level cost looks
like serialized per-actor reap / `terminate_after` waits.
## Mitigation applied
`test_nested_multierrors` now splits the `trio` budget:
```python
case ('trio', 1):
timeout = 6
case ('trio', 3):
timeout = 12 # was 6; see this doc
```
This stops the deadline from firing so the cascade
completes naturally to `[RAE, RAE]`.
## Also affected — same root cause, different test
`test_echoserver_detailed_mechanics[trio-raise_error=KeyboardInterrupt]`
(`tests/test_infected_asyncio.py`) tripped the *same*
slowdown via its much tighter `trio` budget of `1s`. The
single-aio-subactor teardown now takes ~1s, so the `1s`
`fail_after` raced the deadline (PASS at 0.99s / FAIL at
1.03s across back-to-back standalone runs). On a deadline-
fire the injected `Cancelled(source='deadline')` wraps the
mid-stream `KeyboardInterrupt` into a `BaseExceptionGroup`,
which is NOT a `KeyboardInterrupt` so the bare
`pytest.raises(KeyboardInterrupt)` fails. (The sibling
`raise_error=Exception` variant only "passes" by accident:
an `ExceptionGroup` *is-a* `Exception`, so its
`pytest.raises(Exception)` still matches even when wrapped.)
Mitigation: bump that `trio` budget `1 -> 4s` (matching the
forking-spawner case). Without a deadline-fire the KBI
propagates bare and the assertion passes.
## Open follow-up (the actual regression)
The budget bump is a band-aid — the underlying question is
**why** the depth=3 `trio` cancel-cascade went from <6s to
~7-8s across `trio` 0.29 -> 0.33. Candidate avenues:
- which scope owns the per-actor `terminate_after` wait,
and are the tree's reaps concurrent or serialized?
- did trio 0.33's abort/reschedule or cancel-reason
bookkeeping change checkpoint timing on the cancel path?
If/when the cascade speeds back up under-budget, depth=3
will start completing well under 12s — at which point the
budget can be tightened back toward 6s as a regression
tripwire. Related (different backend, same cascade class):
`cancel_cascade_too_slow_under_main_thread_forkserver_issue.md`.

View File

@ -5,7 +5,6 @@
from __future__ import annotations
import platform
import os
import re
import signal
import time
from typing import (
@ -295,26 +294,6 @@ def expect(
PROMPT = r"\(Pdb\+\)"
# Strip terminal color / ANSI-VT100 escape sequences so
# substring matching against REPL + traceback output stays
# robust to color leakage — Python 3.13's colored tracebacks,
# `pdbp`'s pygments highlighting, etc. — even when
# `PYTHON_COLORS=0` (set in the `spawn` fixture) isn't honored
# by every renderer in the spawned subproc.
# Regex per https://stackoverflow.com/a/14693789
_ansi_re: re.Pattern = re.compile(
r'\x1B(?:[@-Z\\-_]|\[[0-?]*[ -/]*[@-~])'
)
def ansi_strip(text: str) -> str:
'''
Remove ANSI/VT100 escape sequences from `text`.
'''
return _ansi_re.sub('', text)
def in_prompt_msg(
child: SpawnBase,
parts: list[str],
@ -334,7 +313,7 @@ def in_prompt_msg(
'''
__tracebackhide__: bool = False
before: str = ansi_strip(str(child.before.decode()))
before: str = str(child.before.decode())
for part in parts:
if part not in before:
if pause_on_false:
@ -354,9 +333,9 @@ def in_prompt_msg(
return True
# NB: color-char stripping (so we can match against call-stack
# frame output from the `ll` command and the like) is handled by
# `ansi_strip()` applied inside `in_prompt_msg()` + below.
# TODO: todo support terminal color-chars stripping so we can match
# against call stack frame output from the the 'll' command the like!
# -[ ] SO answer for stipping ANSI codes: https://stackoverflow.com/a/14693789
def assert_before(
child: SpawnBase,
patts: list[str],
@ -377,7 +356,7 @@ def assert_before(
err_on_false=True,
**kwargs
)
before: str = ansi_strip(str(child.before.decode()))
before: str = str(child.before.decode())
return before

View File

@ -1186,12 +1186,7 @@ def test_shield_pause(
"('cancelled_before_pause'", # actor name
_repl_fail_msg,
"trio.Cancelled",
# trio >=0.30 raises via a multi-line
# `raise Cancelled._create(source=.., reason=..,
# source_task=..)` (cancel-reason metadata), so
# match the open-paren form only, NOT the legacy
# bare `()`.
"raise Cancelled._create(",
"raise Cancelled._create()",
# we should be handling a taskc inside
# the first `.port_mortem()` sin-shield!
@ -1209,12 +1204,7 @@ def test_shield_pause(
"('root'", # actor name
_repl_fail_msg,
"trio.Cancelled",
# trio >=0.30 raises via a multi-line
# `raise Cancelled._create(source=.., reason=..,
# source_task=..)` (cancel-reason metadata), so
# match the open-paren form only, NOT the legacy
# bare `()`.
"raise Cancelled._create(",
"raise Cancelled._create()",
# handling a taskc inside the first unshielded
# `.port_mortem()`.

View File

@ -8,9 +8,9 @@ after `Actor` construction, so any spawned sub-actor process
should:
- have `argv[0]` (== `/proc/<pid>/cmdline`) start with
`<_def_prefix>[<aid.reprol()>]` (currently `_subactor[]`)
- have `/proc/<pid>/comm` start with `<_def_prefix>[`
(kernel truncates to ~15 bytes)
`tractor[<aid.reprol()>]`
- have `/proc/<pid>/comm` start with `tractor[` (kernel
truncates to ~15 bytes)
- be detected as a tractor sub-actor by
`_is_tractor_subactor(pid)` via the cmdline marker.
@ -27,10 +27,7 @@ import trio
import tractor
from tractor.runtime._runtime import Actor
from tractor.devx._proctitle import (
set_actor_proctitle,
_def_prefix,
)
from tractor.devx._proctitle import set_actor_proctitle
from tractor._testing._reap import (
_is_tractor_subactor,
_read_cmdline,
@ -44,9 +41,8 @@ _non_linux: bool = platform.system() != 'Linux'
def test_set_actor_proctitle_format():
'''
`set_actor_proctitle()` returns the canonical
`<_def_prefix>[<aid.reprol()>]` form (currently
`_subactor[]`) and actually mutates the running
proc's title.
`tractor[<aid.reprol()>]` form and actually mutates
the running proc's title.
'''
pytest.importorskip(
@ -64,14 +60,12 @@ def test_set_actor_proctitle_format():
)
title: str = set_actor_proctitle(actor)
# canonical wrapping: `<_def_prefix>[<aid.reprol()>]`.
# We source BOTH the prefix (`_def_prefix`) and the
# runtime-computed `reprol()` rather than hard-coding,
# so the test stays decoupled from the prefix shape
# (flipped to `_subactor` in `3a45dbd5`) AND from
# `Aid.reprol()`'s internal format (currently
# `<name>@<pid>`, but could evolve).
expected: str = f'{_def_prefix}[{actor.aid.reprol()}]'
# canonical wrapping: `tractor[<aid.reprol()>]`. We
# compare against the runtime-computed `reprol()`
# rather than a hard-coded value so the test stays
# decoupled from `Aid.reprol()`'s internal format
# (currently `<name>@<pid>`, but could evolve).
expected: str = f'tractor[{actor.aid.reprol()}]'
assert title == expected
# sanity: the actor's name must be in the title
# somewhere (so a future `reprol()` change that
@ -146,17 +140,15 @@ def test_subactor_proctitle_visible_via_proc():
)
pid, info = matched[0]
# canonical proctitle prefix in cmdline (full form);
# prefix sourced from `_def_prefix` so it tracks the
# `3a45dbd5` flip (`tractor[` -> `_subactor[`).
assert info['cmdline'].startswith(f'{_def_prefix}[proctitle_boi@'), (
f'cmdline missing `{_def_prefix}[proctitle_boi@…]` prefix: '
# canonical proctitle prefix in cmdline (full form)
assert info['cmdline'].startswith('tractor[proctitle_boi@'), (
f'cmdline missing `tractor[proctitle_boi@…]` prefix: '
f'{info["cmdline"]!r}'
)
# comm is kernel-truncated to ~15 bytes — just check the
# `<_def_prefix>[` prefix made it.
assert info['comm'].startswith(f'{_def_prefix}['), (
f'comm missing `{_def_prefix}[` prefix: {info["comm"]!r}'
# `tractor[` prefix made it.
assert info['comm'].startswith('tractor['), (
f'comm missing `tractor[` prefix: {info["comm"]!r}'
)
# intrinsic-signal detector should match.
assert info['is_tractor'] is True

View File

@ -605,28 +605,17 @@ async def test_nested_multierrors(
)
)
# Per-backend/-depth budgets: in the non-hang case the
# whole spawn + cancel-cascade should complete in well
# under these. On the borderline hang case the
# `fail_after_w_trace` fires `TooSlowError` AND captures a
# ptree/wchan/py-spy snapshot to
# 6s budget: in the non-hang case (and on the trio
# backend) the whole spawn + cancel-cascade should
# complete in well under that. On the borderline hang
# case the `fail_after_w_trace` fires `TooSlowError`
# AND captures a ptree/wchan/py-spy snapshot to
# `$XDG_CACHE_HOME/tractor/hung-dumps/` for offline
# inspection. See
# `ai/conc-anal/cancel_cascade_too_slow_under_main_thread_forkserver_issue.md`.
#
# NOTE: the `trio` depth=3 budget was bumped 6 -> 12s after
# the `trio` 0.29 -> 0.33 lock bump (commit c7741bba) slowed
# the depth-3 cancel-cascade from <6s to ~7-8s; the 6s
# deadline was firing and its `Cancelled(source='deadline')`
# (trio 0.33 cancel-reason metadata) collapsed a BEG branch,
# breaking the `RemoteActorError` assertion below. depth=1
# still finishes in ~3s so keeps the 6s budget. See
# `ai/conc-anal/trio_033_cancel_cascade_slowdown_depth3_issue.md`.
match (start_method, depth):
case ('trio', 1):
case ('trio', _):
timeout = 6
case ('trio', 3):
timeout = 12
case ('main_thread_forkserver', 1):
timeout = 16
case ('main_thread_forkserver', 3):

View File

@ -860,14 +860,7 @@ def test_echoserver_detailed_mechanics(
timeout: float = (
999 if tractor.debug_mode()
else 4 if is_forking_spawner
# was 1; the `trio` 0.29 -> 0.33 bump slowed the
# cancel-cascade so a 1s budget raced the ~1s teardown
# deadline. On a deadline-fire the injected
# `Cancelled(source='deadline')` wraps the mid-stream
# KBI in a `BaseExceptionGroup`, breaking the bare
# `pytest.raises(KeyboardInterrupt)` below. See
# `ai/conc-anal/trio_033_cancel_cascade_slowdown_depth3_issue.md`.
else 4
else 1
)
# body factored out so the `fail_after_w_trace`-wrapping

View File

@ -24,10 +24,7 @@ which" at a glance without needing to read full
`/proc/<pid>/cmdline`.
Format:
``<_def_prefix>[<aid.reprol()>]`` e.g. ``_subactor[doggy@1027301b]``
(prefix from the `_def_prefix` const, flipped `tractor` ->
`_subactor` so sub-actor procs are visually distinct from the
root in `ps`/`htop` and the reap-recognition markers.)
``tractor[<aid.reprol()>]`` e.g. ``tractor[doggy@1027301b]``
Uses the canonical `Aid.reprol()` form
(``<name>@<uuid_short>``) so the proc-title matches the