Compare commits
4 Commits
3a45dbd503
...
f42dfe50a9
| Author | SHA1 | Date |
|---|---|---|
|
|
f42dfe50a9 | |
|
|
d7da502d93 | |
|
|
ea67f1b67b | |
|
|
3854cf5ecb |
|
|
@ -0,0 +1,102 @@
|
|||
# `trio` 0.29 -> 0.33 slows the depth=3 cancel-cascade
|
||||
|
||||
## Symptom
|
||||
|
||||
After locking to `trio==0.33.0` (commit `c7741bba`, was
|
||||
`0.29.0`), this test reliably trips its `fail_after`
|
||||
deadline on the **`trio`** backend:
|
||||
|
||||
```
|
||||
FAILED tests/test_cancellation.py::test_nested_multierrors[start_method=trio-depth=3]
|
||||
- AssertionError: assert False
|
||||
where False = isinstance(
|
||||
Cancelled(source='deadline', source_task=None, reason=None),
|
||||
tractor.RemoteActorError,
|
||||
)
|
||||
```
|
||||
|
||||
A `fail_after_w_trace` hang-snapshot is captured for the
|
||||
test each run (deadline-injected `Cancelled` wrapped into
|
||||
the actor-nursery `BaseExceptionGroup`).
|
||||
|
||||
## Root cause (immediate)
|
||||
|
||||
The test budgets `fail_after(6)` for the `trio` backend.
|
||||
That 6s was chosen (commit `32955db0`, while `trio==0.29`)
|
||||
with the assertion that trio finishes "well under" 6s.
|
||||
The `trio` 0.29 -> 0.33 bump slowed the depth=3 cascade
|
||||
past that budget, so the 6s deadline now fires mid-cascade.
|
||||
|
||||
trio 0.33 added **cancel-reason tracking** — every
|
||||
`Cancelled` now carries `(source=, reason=, source_task=)`.
|
||||
The injected exc is `Cancelled(source='deadline')`, i.e.
|
||||
trio itself naming our `fail_after(6)` scope as the cancel
|
||||
origin. When that `Cancelled` collapses one branch of the
|
||||
nursery BEG, the test's `isinstance(subexc,
|
||||
RemoteActorError)` assertion fails. The healthy outcome is
|
||||
`BEG = [RemoteActorError, RemoteActorError]`; the
|
||||
`Cancelled` is purely an artifact of the deadline cutting
|
||||
the cascade short.
|
||||
|
||||
## Measurements (standalone, this machine)
|
||||
|
||||
```
|
||||
depth=1 trio ~3.15s PASS (keeps 6s budget)
|
||||
depth=3 trio ~6.8-8.2s FAIL @ 6s (now bumped to 12s)
|
||||
```
|
||||
|
||||
depth=1 still fits comfortably; only depth=3 (deeper
|
||||
recursive spawn-and-error tree => more actors to reap)
|
||||
exceeds the old budget. The ~2s/depth-level cost looks
|
||||
like serialized per-actor reap / `terminate_after` waits.
|
||||
|
||||
## Mitigation applied
|
||||
|
||||
`test_nested_multierrors` now splits the `trio` budget:
|
||||
|
||||
```python
|
||||
case ('trio', 1):
|
||||
timeout = 6
|
||||
case ('trio', 3):
|
||||
timeout = 12 # was 6; see this doc
|
||||
```
|
||||
|
||||
This stops the deadline from firing so the cascade
|
||||
completes naturally to `[RAE, RAE]`.
|
||||
|
||||
## Also affected — same root cause, different test
|
||||
|
||||
`test_echoserver_detailed_mechanics[trio-raise_error=KeyboardInterrupt]`
|
||||
(`tests/test_infected_asyncio.py`) tripped the *same*
|
||||
slowdown via its much tighter `trio` budget of `1s`. The
|
||||
single-aio-subactor teardown now takes ~1s, so the `1s`
|
||||
`fail_after` raced the deadline (PASS at 0.99s / FAIL at
|
||||
1.03s across back-to-back standalone runs). On a deadline-
|
||||
fire the injected `Cancelled(source='deadline')` wraps the
|
||||
mid-stream `KeyboardInterrupt` into a `BaseExceptionGroup`,
|
||||
which is NOT a `KeyboardInterrupt` so the bare
|
||||
`pytest.raises(KeyboardInterrupt)` fails. (The sibling
|
||||
`raise_error=Exception` variant only "passes" by accident:
|
||||
an `ExceptionGroup` *is-a* `Exception`, so its
|
||||
`pytest.raises(Exception)` still matches even when wrapped.)
|
||||
|
||||
Mitigation: bump that `trio` budget `1 -> 4s` (matching the
|
||||
forking-spawner case). Without a deadline-fire the KBI
|
||||
propagates bare and the assertion passes.
|
||||
|
||||
## Open follow-up (the actual regression)
|
||||
|
||||
The budget bump is a band-aid — the underlying question is
|
||||
**why** the depth=3 `trio` cancel-cascade went from <6s to
|
||||
~7-8s across `trio` 0.29 -> 0.33. Candidate avenues:
|
||||
|
||||
- which scope owns the per-actor `terminate_after` wait,
|
||||
and are the tree's reaps concurrent or serialized?
|
||||
- did trio 0.33's abort/reschedule or cancel-reason
|
||||
bookkeeping change checkpoint timing on the cancel path?
|
||||
|
||||
If/when the cascade speeds back up under-budget, depth=3
|
||||
will start completing well under 12s — at which point the
|
||||
budget can be tightened back toward 6s as a regression
|
||||
tripwire. Related (different backend, same cascade class):
|
||||
`cancel_cascade_too_slow_under_main_thread_forkserver_issue.md`.
|
||||
|
|
@ -5,6 +5,7 @@
|
|||
from __future__ import annotations
|
||||
import platform
|
||||
import os
|
||||
import re
|
||||
import signal
|
||||
import time
|
||||
from typing import (
|
||||
|
|
@ -294,6 +295,26 @@ def expect(
|
|||
PROMPT = r"\(Pdb\+\)"
|
||||
|
||||
|
||||
# Strip terminal color / ANSI-VT100 escape sequences so
|
||||
# substring matching against REPL + traceback output stays
|
||||
# robust to color leakage — Python 3.13's colored tracebacks,
|
||||
# `pdbp`'s pygments highlighting, etc. — even when
|
||||
# `PYTHON_COLORS=0` (set in the `spawn` fixture) isn't honored
|
||||
# by every renderer in the spawned subproc.
|
||||
# Regex per https://stackoverflow.com/a/14693789
|
||||
_ansi_re: re.Pattern = re.compile(
|
||||
r'\x1B(?:[@-Z\\-_]|\[[0-?]*[ -/]*[@-~])'
|
||||
)
|
||||
|
||||
|
||||
def ansi_strip(text: str) -> str:
|
||||
'''
|
||||
Remove ANSI/VT100 escape sequences from `text`.
|
||||
|
||||
'''
|
||||
return _ansi_re.sub('', text)
|
||||
|
||||
|
||||
def in_prompt_msg(
|
||||
child: SpawnBase,
|
||||
parts: list[str],
|
||||
|
|
@ -313,7 +334,7 @@ def in_prompt_msg(
|
|||
'''
|
||||
__tracebackhide__: bool = False
|
||||
|
||||
before: str = str(child.before.decode())
|
||||
before: str = ansi_strip(str(child.before.decode()))
|
||||
for part in parts:
|
||||
if part not in before:
|
||||
if pause_on_false:
|
||||
|
|
@ -333,9 +354,9 @@ def in_prompt_msg(
|
|||
return True
|
||||
|
||||
|
||||
# TODO: todo support terminal color-chars stripping so we can match
|
||||
# against call stack frame output from the the 'll' command the like!
|
||||
# -[ ] SO answer for stipping ANSI codes: https://stackoverflow.com/a/14693789
|
||||
# NB: color-char stripping (so we can match against call-stack
|
||||
# frame output from the `ll` command and the like) is handled by
|
||||
# `ansi_strip()` applied inside `in_prompt_msg()` + below.
|
||||
def assert_before(
|
||||
child: SpawnBase,
|
||||
patts: list[str],
|
||||
|
|
@ -356,7 +377,7 @@ def assert_before(
|
|||
err_on_false=True,
|
||||
**kwargs
|
||||
)
|
||||
before: str = str(child.before.decode())
|
||||
before: str = ansi_strip(str(child.before.decode()))
|
||||
return before
|
||||
|
||||
|
||||
|
|
|
|||
|
|
@ -1186,7 +1186,12 @@ def test_shield_pause(
|
|||
"('cancelled_before_pause'", # actor name
|
||||
_repl_fail_msg,
|
||||
"trio.Cancelled",
|
||||
"raise Cancelled._create()",
|
||||
# trio >=0.30 raises via a multi-line
|
||||
# `raise Cancelled._create(source=.., reason=..,
|
||||
# source_task=..)` (cancel-reason metadata), so
|
||||
# match the open-paren form only, NOT the legacy
|
||||
# bare `()`.
|
||||
"raise Cancelled._create(",
|
||||
|
||||
# we should be handling a taskc inside
|
||||
# the first `.port_mortem()` sin-shield!
|
||||
|
|
@ -1204,7 +1209,12 @@ def test_shield_pause(
|
|||
"('root'", # actor name
|
||||
_repl_fail_msg,
|
||||
"trio.Cancelled",
|
||||
"raise Cancelled._create()",
|
||||
# trio >=0.30 raises via a multi-line
|
||||
# `raise Cancelled._create(source=.., reason=..,
|
||||
# source_task=..)` (cancel-reason metadata), so
|
||||
# match the open-paren form only, NOT the legacy
|
||||
# bare `()`.
|
||||
"raise Cancelled._create(",
|
||||
|
||||
# handling a taskc inside the first unshielded
|
||||
# `.port_mortem()`.
|
||||
|
|
|
|||
|
|
@ -8,9 +8,9 @@ after `Actor` construction, so any spawned sub-actor process
|
|||
should:
|
||||
|
||||
- have `argv[0]` (== `/proc/<pid>/cmdline`) start with
|
||||
`tractor[<aid.reprol()>]`
|
||||
- have `/proc/<pid>/comm` start with `tractor[` (kernel
|
||||
truncates to ~15 bytes)
|
||||
`<_def_prefix>[<aid.reprol()>]` (currently `_subactor[…]`)
|
||||
- have `/proc/<pid>/comm` start with `<_def_prefix>[`
|
||||
(kernel truncates to ~15 bytes)
|
||||
- be detected as a tractor sub-actor by
|
||||
`_is_tractor_subactor(pid)` via the cmdline marker.
|
||||
|
||||
|
|
@ -27,7 +27,10 @@ import trio
|
|||
import tractor
|
||||
|
||||
from tractor.runtime._runtime import Actor
|
||||
from tractor.devx._proctitle import set_actor_proctitle
|
||||
from tractor.devx._proctitle import (
|
||||
set_actor_proctitle,
|
||||
_def_prefix,
|
||||
)
|
||||
from tractor._testing._reap import (
|
||||
_is_tractor_subactor,
|
||||
_read_cmdline,
|
||||
|
|
@ -41,8 +44,9 @@ _non_linux: bool = platform.system() != 'Linux'
|
|||
def test_set_actor_proctitle_format():
|
||||
'''
|
||||
`set_actor_proctitle()` returns the canonical
|
||||
`tractor[<aid.reprol()>]` form and actually mutates
|
||||
the running proc's title.
|
||||
`<_def_prefix>[<aid.reprol()>]` form (currently
|
||||
`_subactor[…]`) and actually mutates the running
|
||||
proc's title.
|
||||
|
||||
'''
|
||||
pytest.importorskip(
|
||||
|
|
@ -60,12 +64,14 @@ def test_set_actor_proctitle_format():
|
|||
)
|
||||
title: str = set_actor_proctitle(actor)
|
||||
|
||||
# canonical wrapping: `tractor[<aid.reprol()>]`. We
|
||||
# compare against the runtime-computed `reprol()`
|
||||
# rather than a hard-coded value so the test stays
|
||||
# decoupled from `Aid.reprol()`'s internal format
|
||||
# (currently `<name>@<pid>`, but could evolve).
|
||||
expected: str = f'tractor[{actor.aid.reprol()}]'
|
||||
# canonical wrapping: `<_def_prefix>[<aid.reprol()>]`.
|
||||
# We source BOTH the prefix (`_def_prefix`) and the
|
||||
# runtime-computed `reprol()` rather than hard-coding,
|
||||
# so the test stays decoupled from the prefix shape
|
||||
# (flipped to `_subactor` in `3a45dbd5`) AND from
|
||||
# `Aid.reprol()`'s internal format (currently
|
||||
# `<name>@<pid>`, but could evolve).
|
||||
expected: str = f'{_def_prefix}[{actor.aid.reprol()}]'
|
||||
assert title == expected
|
||||
# sanity: the actor's name must be in the title
|
||||
# somewhere (so a future `reprol()` change that
|
||||
|
|
@ -140,15 +146,17 @@ def test_subactor_proctitle_visible_via_proc():
|
|||
)
|
||||
|
||||
pid, info = matched[0]
|
||||
# canonical proctitle prefix in cmdline (full form)
|
||||
assert info['cmdline'].startswith('tractor[proctitle_boi@'), (
|
||||
f'cmdline missing `tractor[proctitle_boi@…]` prefix: '
|
||||
# canonical proctitle prefix in cmdline (full form);
|
||||
# prefix sourced from `_def_prefix` so it tracks the
|
||||
# `3a45dbd5` flip (`tractor[` -> `_subactor[`).
|
||||
assert info['cmdline'].startswith(f'{_def_prefix}[proctitle_boi@'), (
|
||||
f'cmdline missing `{_def_prefix}[proctitle_boi@…]` prefix: '
|
||||
f'{info["cmdline"]!r}'
|
||||
)
|
||||
# comm is kernel-truncated to ~15 bytes — just check the
|
||||
# `tractor[` prefix made it.
|
||||
assert info['comm'].startswith('tractor['), (
|
||||
f'comm missing `tractor[` prefix: {info["comm"]!r}'
|
||||
# `<_def_prefix>[` prefix made it.
|
||||
assert info['comm'].startswith(f'{_def_prefix}['), (
|
||||
f'comm missing `{_def_prefix}[` prefix: {info["comm"]!r}'
|
||||
)
|
||||
# intrinsic-signal detector should match.
|
||||
assert info['is_tractor'] is True
|
||||
|
|
|
|||
|
|
@ -605,17 +605,28 @@ async def test_nested_multierrors(
|
|||
)
|
||||
)
|
||||
|
||||
# 6s budget: in the non-hang case (and on the trio
|
||||
# backend) the whole spawn + cancel-cascade should
|
||||
# complete in well under that. On the borderline hang
|
||||
# case the `fail_after_w_trace` fires `TooSlowError`
|
||||
# AND captures a ptree/wchan/py-spy snapshot to
|
||||
# Per-backend/-depth budgets: in the non-hang case the
|
||||
# whole spawn + cancel-cascade should complete in well
|
||||
# under these. On the borderline hang case the
|
||||
# `fail_after_w_trace` fires `TooSlowError` AND captures a
|
||||
# ptree/wchan/py-spy snapshot to
|
||||
# `$XDG_CACHE_HOME/tractor/hung-dumps/` for offline
|
||||
# inspection. See
|
||||
# `ai/conc-anal/cancel_cascade_too_slow_under_main_thread_forkserver_issue.md`.
|
||||
#
|
||||
# NOTE: the `trio` depth=3 budget was bumped 6 -> 12s after
|
||||
# the `trio` 0.29 -> 0.33 lock bump (commit c7741bba) slowed
|
||||
# the depth-3 cancel-cascade from <6s to ~7-8s; the 6s
|
||||
# deadline was firing and its `Cancelled(source='deadline')`
|
||||
# (trio 0.33 cancel-reason metadata) collapsed a BEG branch,
|
||||
# breaking the `RemoteActorError` assertion below. depth=1
|
||||
# still finishes in ~3s so keeps the 6s budget. See
|
||||
# `ai/conc-anal/trio_033_cancel_cascade_slowdown_depth3_issue.md`.
|
||||
match (start_method, depth):
|
||||
case ('trio', _):
|
||||
case ('trio', 1):
|
||||
timeout = 6
|
||||
case ('trio', 3):
|
||||
timeout = 12
|
||||
case ('main_thread_forkserver', 1):
|
||||
timeout = 16
|
||||
case ('main_thread_forkserver', 3):
|
||||
|
|
|
|||
|
|
@ -860,7 +860,14 @@ def test_echoserver_detailed_mechanics(
|
|||
timeout: float = (
|
||||
999 if tractor.debug_mode()
|
||||
else 4 if is_forking_spawner
|
||||
else 1
|
||||
# was 1; the `trio` 0.29 -> 0.33 bump slowed the
|
||||
# cancel-cascade so a 1s budget raced the ~1s teardown
|
||||
# deadline. On a deadline-fire the injected
|
||||
# `Cancelled(source='deadline')` wraps the mid-stream
|
||||
# KBI in a `BaseExceptionGroup`, breaking the bare
|
||||
# `pytest.raises(KeyboardInterrupt)` below. See
|
||||
# `ai/conc-anal/trio_033_cancel_cascade_slowdown_depth3_issue.md`.
|
||||
else 4
|
||||
)
|
||||
|
||||
# body factored out so the `fail_after_w_trace`-wrapping
|
||||
|
|
|
|||
|
|
@ -24,7 +24,10 @@ which" at a glance without needing to read full
|
|||
`/proc/<pid>/cmdline`.
|
||||
|
||||
Format:
|
||||
``tractor[<aid.reprol()>]`` e.g. ``tractor[doggy@1027301b]``
|
||||
``<_def_prefix>[<aid.reprol()>]`` e.g. ``_subactor[doggy@1027301b]``
|
||||
(prefix from the `_def_prefix` const, flipped `tractor` ->
|
||||
`_subactor` so sub-actor procs are visually distinct from the
|
||||
root in `ps`/`htop` and the reap-recognition markers.)
|
||||
|
||||
Uses the canonical `Aid.reprol()` form
|
||||
(``<name>@<uuid_short>``) so the proc-title matches the
|
||||
|
|
|
|||
Loading…
Reference in New Issue