Commit Graph

2 Commits (4a3254583b1168382314fdad1b268c2498bbf729)

Author SHA1 Message Date
Gud Boi 4a3254583b Doc `subint` backend hang classes + arm `dump_on_hang`
Classify and write up the two distinct hang modes hit during Phase
B subint bringup (issue #379) so future triage doesn't re-derive them
from scratch.

Deats, two new `ai/conc-anal/` docs,
- `subint_sigint_starvation_issue.md`: abandoned legacy-subint thread
  + shared GIL → main trio loop starves → signal-wakeup-fd pipe fills
  → `SIGINT` silently dropped (`strace` shows `write() = EAGAIN` on the
  wakeup-fd). Un- Ctrl-C-able. Structurally a CPython limit; blocked on
  `msgspec` PEP 684 (jcrist/msgspec#563)

- `subint_cancel_delivery_hang_issue.md`: parent-side trio task parks on
  an orphaned IPC channel after subint teardown — no clean EOF delivered
  to the waiting receive. Ctrl-C-able (main loop iterates fine); OUR bug
  to fix. Candidate fix: explicit parent-side channel abort in
  `subint_proc`'s hard-kill teardown

Cross-link the docs from their test reproducers,
- `test_stale_entry_is_deleted` (→ starvation class): wrap
  `trio.run(main)` in `dump_on_hang(seconds=20)` so a future regression
  captures a stack dump. Kept un- skipped so the dump file is
  inspectable

- `test_subint_non_checkpointing_child` (→ delivery class): extend
  docstring with a "KNOWN ISSUE" block pointing at the analysis

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-23 18:47:49 -04:00
Gud Boi 2ed5e6a6e8 Add `subint` cancellation + hard-kill test audit
Lock in the escape-hatch machinery added to `tractor.spawn._subint`
during the Phase B.2/B.3 bringup (issue #379) so future stdlib
regressions or our own refactors don't silently re-introduce the
mid-suite hangs.

Deats,
- `test_subint_happy_teardown`: baseline — spawn a subactor, one portal
  RPC, clean teardown. If this breaks, something's wrong unrelated to
  the hard-kill shields.
- `test_subint_non_checkpointing_child`: cancel a subactor stuck in
  a non-checkpointing Python loop (`threading.Event.wait()` releases the
  GIL but never inserts a trio checkpoint). Validates the bounded-shield
  + daemon-driver-thread combo abandons the thread after
    `_HARD_KILL_TIMEOUT`.

Every test is wrapped in `trio.fail_after()` for a deterministic
per-test wall-clock ceiling (an unbounded audit would defeat itself) and
arms `tractor.devx.dump_on_hang()` so a hang captures a stack dump
— pytest's stderr capture swallows `faulthandler` output by default.

Gated via `pytest.importorskip('concurrent.interpreters')` and
a module-level skip when `--spawn-backend` isn't `'subint'`.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
2026-04-23 18:47:49 -04:00