diff --git a/.claude/skills/piker-conc-expert/SKILL.md b/.claude/skills/piker-conc-expert/SKILL.md new file mode 100644 index 00000000..550087d0 --- /dev/null +++ b/.claude/skills/piker-conc-expert/SKILL.md @@ -0,0 +1,150 @@ +--- +name: piker-conc-expert +description: > + Distributed-runtime and structured-concurrency + expertise for piker's `tractor` actor-tree. Apply + when working on daemon/service architecture, actor + spawning/discovery, cross-actor RPC (ctx/stream + eps), `to_asyncio` integration, cancellation + semantics, or debugging hangs/wedges/skews in the + actor system. +user-invocable: false +--- + +# Piker Concurrency & Runtime Expertise + +The distilled mental model for piker's distributed +runtime: a `trio`-structured actor tree supervised by +`tractor` (pinned to git main) where every long-lived +subsystem is a named daemon-actor talking over +ctx/stream IPC. + +## Actor tree & daemon taxonomy + +``` +pikerd root supervisor + registry +├── datad. feed bus, shm writers, tsp +│ history, symbol search +├── brokerd. live order-ctl ONLY; lazily +│ spawned by emsd, credentialed +├── emsd dark-clearing + order routing +│ └── paperboi. sim-clearing (paper mode) +└── samplerd singleton OHLC clock/increment +``` + +Key invariants: +- `datad` hosts all `piker.data.validate._eps['datad']` + eps; `brokerd` only the `['brokerd']` (order-ctl) + ones. The `_eps` table in `piker/data/validate.py` + is the authoritative contract; `get_eps(mod, kind)` + introspects a backend's support. +- `brokerd.` is booted in EXACTLY one place: + `open_brokerd_dialog()` in `piker/clearing/_ems.py` + (with a `portal:` override for the `piker ledger` + ad-hoc actor). Chart-only + paper sessions run with + ZERO brokerd procs. Never add a data-path spawn! +- backends declare per-daemon-kind submods via + `_datad_mods`/`_brokerd_mods` in their + `__init__.py` (fallback: `__enable_modules__`). + +## Daemon lifecycle conventions + +Every daemon-kind follows the same trio of fns (see +`piker/brokers/_daemon.py` + `piker/data/_daemon.py` +as the canonical pair): + +- `_setup_persistent_()`: a `@tractor.context` + "lifetime fixture" run via + `Services.start_service_task()`; does console-log + setup ONCE for the actor, allocs any actor-global + state (eg. datad's `_FeedsBus`), then + `await ctx.started()` + `trio.sleep_forever()`. +- `_init()`: builds `enable_modules` + actor + name `f'.{brokername}'` and copies backend + `_spawn_kwargs` (CRITICAL: `ib` needs + `infect_asyncio=True` in EVERY daemon-kind). +- `spawn_()` + `maybe_spawn_()`: thin + wrappers over `Services.actor_n.start_actor()` and + `piker.service.maybe_spawn_daemon()` (registry + find-or-spawn w/ per-service-name locking). + +Caps-sec model: `enable_modules` gates RPC entry ONLY +— python imports are unrestricted in-proc. Keep each +daemon's enable set minimal; the (credentialed) +`brokerd` must never RPC-enable `piker.data.*` feed +mods. + +## Actor-local state: the #1 split hazard + +Module-globals and instance caches are PER-ACTOR. +Anything that "just worked" because two subsystems +shared a process will break when they're split into +sibling actors. Canonical example: `ib`'s +`Client._contracts` was warmed by feed-side +`get_mkt_info()` in-proc; post datad/brokerd-split +the trading actor must warm it itself (eagerly at +`open_trade_dialog()` startup for open pps/orders + +lazily per order request via +`symbols.cache_contract()`). + +When moving code across actor boundaries ALWAYS audit: +- module-global registries (`feed._bus`, + `_accounts2clients`, `_client_cache`, ..) +- `@async_lifo_cache`/`maybe_open_context` caches + (NOTE: `async_lifo_cache` keys on POSITIONAL args + only; a cache-hit SKIPS the fn body and thus any + side-effect writes!) +- logging handler placement (see gotchas.md) + +## tractor primitives as used here + +- `@tractor.context` eps: `await ctx.started(val)` + unblocks the caller w/ `val`; long-lived eps then + `ctx.open_stream()` or `sleep_forever()`. +- discovery: `tractor.find_actor()` via + `piker.service.find_service()`; + `wait_for_actor(name, registry_addr=...)`; + `query_actor(name, regaddr=...)` yields + `(sockaddr, portal)`. Addrs are wrapped + `tractor.discovery._addr.Address` types — use + `wrap_address()` to normalize raw tuples and + `.unwrap()` for comparisons. +- runtime-vars: `_runtime_vars['piker_vars']` is + inherited down the spawn tree; used eg. for + `piker_test_dir` config isolation — read LAZILY at + use-time, never at import time (subactors only get + vars post runtime-boot). +- cancellation semantics (modern tractor): a + `ContextCancelled` whose `.canceller` is your own + actor is ABSORBED (clean exit, nothing raised); + single-exc groups collapse (`collapse_eg`) so eg. + a KBI propagates bare. Exc attrs: + `RemoteActorError.boxed_type` (not `.type`). + +## `to_asyncio` (infect-asyncio) integration + +For `ib` (and `deribit`) the backend client runs on +an embedded `asyncio` loop via +`tractor.to_asyncio.open_channel_from()` + +`LinkedTaskChannel`. + +Rules learned the hard way: +- a shared req/resp channel MUST correlate responses + to requests (see `MethodProxy._run_method()`'s + `mid` protocol in `piker/brokers/ib/api.py`): + caller cancellation (eg. `move_on_after` timeouts) + otherwise orphans a response and silently skews + every later result off-by-one. +- the aio-side relay must catch + ship back ALL + (non-cancel) exceptions as `{'exception': err}` + resps; an escaping error kills the relay task -> + channel -> proxy nursery -> the whole dialog, + bypassing every caller-side guard. +- `TrioTaskExited` ("child asyncio task is still + running?") on teardown is a known wart family; + prefer upstream `tractor` fixes over piker-side + bandaids. + +See [gotchas.md](gotchas.md) for the symptom->cause +registry and [debug-recipes.md](debug-recipes.md) for +forensics techniques. diff --git a/.claude/skills/piker-conc-expert/debug-recipes.md b/.claude/skills/piker-conc-expert/debug-recipes.md new file mode 100644 index 00000000..6ef61517 --- /dev/null +++ b/.claude/skills/piker-conc-expert/debug-recipes.md @@ -0,0 +1,100 @@ +# Debug recipes: actor-system forensics + +Field-tested techniques for diagnosing hangs, wedges +and cross-actor state bugs WITHOUT a debugger attached +(or when `py-spy` ain't installed). + +## Wedged actor triage (no REPL) + +1. find the tree: + `ps -eo pid,etime,args | grep -E 'pytest|tractor._child'` + — long-`etime` `tractor._child` procs w/ a stuck + parent = wedge. +2. kernel state: + `cat /proc//wchan` + `status | grep -E + 'State|Threads'` — `do_epoll_wait` + sleeping = + idle event loop, NOT cpu-spin. +3. **the money read** — socket queues: + `ss -tnp | grep ` + - `Recv-Q > 0` on the parent-IPC conn = the actor + STOPPED CONSUMING its msg loop (runtime bug), + parent is waiting on it. + - zero external (api/ws) conns = wedged before/ + without provider IO; don't blame the network. + - `CLOSE-WAIT` lingerers = unclean peer teardown. +4. cleanup: `pkill -f tractor._child` (NB: in + compound shell cmds `pkill`'s exit code poisons + `&&` chains — run it standalone). + +## Hang-proof test gating + +- per-suite, never combined (cross-suite session + state interacts w/ the 2nd-boot wedge): + `timeout -k 5 300 python -m pytest tests/.py -q` +- rc 124/143 = hang-kill -> retry ONCE before + investigating. +- isolate a flaky test w/ a 3x loop; ~50% hit-rate + signatures match the known 2nd-boot wedge (see + gotchas.md). + +## Regression vs pre-existing attribution + +When a failure appears mid-refactor: +1. `git stash -u` (or checkout the file subset) and + re-run the EXACT failing case at baseline. +2. if baseline can't even run, selectively revert + ONLY the suspect layer: + `git diff > /tmp/x.patch; + git checkout ` -> test -> + `git apply /tmp/x.patch`. +3. flake-rate compare (3x runs) beats single-shot + conclusions. + +## Off-by-one / stale IPC resp detection + +Mismatched query->result content in logs (resp +payload obviously for a prior request) = shared +req/resp channel w/o correlation + a cancelled +caller. Grep the ep for `move_on_after`/`fail_after` +around proxied calls. Fix = req-id (`mid`) tagging, +never "just a lock" (cancellation still orphans). + +## Logging-chain audits + +When records double-print or go bare (see gotchas.md): + +```python +import logging +l = logging.getLogger('piker.brokers.ib.broker') +while l: + print(l.name, l.level, l.handlers, l.propagate) + l = l.parent +``` + +Exactly ONE stderr handler should exist in the chain, +attached by the actor's daemon fixture. + +## Live actor-tree smoke (headless) + +Boot against an ALT registry port so a user's running +stack is untouched; script in a REAL file (tractor +children re-exec `__main__` from path — stdin scripts +crash w/ `FileNotFoundError: .../`): + +```python +async with maybe_open_pikerd( + registry_addrs=[('127.0.0.1', 6979)], +): + async with open_feed(['xbtusdt.kraken']) as feed: + assert await check_for_service('datad.kraken') + assert not await check_for_service( + 'brokerd.kraken' + ) +``` + +## In-proc fail-fast unit checks + +Spawn-path guards that raise BEFORE touching the +runtime can be tested w/ a bare `trio.run()` (eg +`spawn_brokerd('kucoin')` raising the datad-only +error) — no pikerd needed. diff --git a/.claude/skills/piker-conc-expert/gotchas.md b/.claude/skills/piker-conc-expert/gotchas.md new file mode 100644 index 00000000..d1519c97 --- /dev/null +++ b/.claude/skills/piker-conc-expert/gotchas.md @@ -0,0 +1,116 @@ +# Known gotchas: symptom -> cause -> fix + +A registry of distributed-runtime failure modes hit +(and diagnosed) in the field; check here FIRST when a +log/traceback matches. + +## "Can not order ..., no qualified contract cached" + +- **Symptom**: `RuntimeError` from + `ib.api.Client.submit_limit()` w/ empty + `Client._contracts` in `brokerd.ib`. +- **Cause**: per-actor cache never warmed; feed-side + qualification now lives in `datad.ib`. +- **Fix(ed)**: eager warmup at `open_trade_dialog()` + start + lazy per-order `get_mkt_info()` + + `cache_contract()` (writes BOTH `mkt.bs_fqme` and + `mkt.fqme` keys; different consumers read each!). + +## Search returns results for the WRONG pattern + +- **Symptom**: fqme search for 'gld' returns nvda + results; next query returns the prior query's set. +- **Cause**: `MethodProxy` channel off-by-one — a + caller cancelled (search `move_on_after` timeout) + after sending its request orphans the response; + every later caller consumes the previous resp. +- **Fix(ed)**: `mid` req-id correlation in + `_run_method()` + relay; stale resps are dropped w/ + a "Dropping stale method-resp" warning. If that + warning spams, some caller is being cancelled + mid-call habitually — find + fix its timeout. + +## One bad request crashes a whole dialog/actor + +- **Symptom**: `TrioTaskExited` storm + nursery + teardown after a single method error (eg ambiguous + contract `AttributeError`). +- **Cause**: exception escaped the aio-side relay + loop (`open_aio_client_method_relay()`) killing + channel + proxy nursery; caller-side `try/except` + CANNOT catch it. +- **Fix(ed)**: relay catches `Exception` -> ships + `{'exception': err, 'mid': ...}` resp; order + handler converts to EMS `BrokerdError` msgs. + +## Ambiguous ib contracts -> `NoneType` attr errors + +- **Symptom**: `'NoneType' object has no attribute + 'primaryExchange'` in `find_contracts()`. +- **Cause**: `qualifyContractsAsync()` returns `None` + entries for ambiguous (eg venue-less stonk fqme + matching multiple listings: 'gld' -> ARCA/USD + + VENTURE/CAD). +- **Fix(ed)**: filter `None`s + raise descriptive + `ValueError` ("use 'gld.arca.ib'"). + +## Double-printed log records (same task id, 2x) + +- **Symptom**: every record from some subsys printed + twice w/ identical task ids. +- **Cause**: stderr handlers attached at TWO levels + of one logger-propagation chain (eg daemon fixture + on `piker.brokers.ib` + an ep calling + `get_console_log(name=__name__)` on the child). + tractor's handler-dedup only checks the SAME + logger, not ancestors. +- **Rule**: console handlers are attached ONCE per + actor in the `_setup_persistent_*()` fixture; eps + needing a different level use `log.setLevel()` + ONLY, never `get_console_log()`. + +## Bare/non-colorized log lines + +- **Symptom**: records w/ no timestamp/actor prefix. +- **Cause**: NO handler anywhere in the emitting + logger's chain -> stdlib `logging.lastResort`. Post + actor-splits, a daemon fixture may only cover its + own subsys subtree (eg datad's `piker.data.*` but + not the backend's `piker.brokers..*`). +- **Fix(ed)**: `_setup_persistent_datad()` enables + BOTH `piker.data.` and + `piker.brokers.` subtrees. + +## 2nd in-proc runtime boot wedges (~50%) + +- **Symptom**: test hangs when one test proc boots a + 2nd `pikerd` (eg `test_multi_fill_positions`'s + persistence re-check); a zombie `*.{broker}` child + lingers w/ unread bytes in its parent-IPC Recv-Q. +- **Cause**: pre-existing `tractor`-main runtime + teardown bug (confirmed independent of piker-layer + changes via revert-testing 2026-06). +- **Mitigation**: run suites per-file wrapped in + `timeout -k 5 300 ...`; retry once on rc 124/143. + Do NOT chase as a regression of unrelated changes. + +## ib client-id collisions post-split + +- **Symptom**: 2nd ib daemon burns the full + conn-timeout retry cycle connecting to gw/tws. +- **Cause**: `datad.ib` + `brokerd.ib` both default + `client_id=6116` w/ linear `+i` retries. +- **Fix(ed)**: role-based offsets in + `load_aio_clients()`: datad +16, ad-hoc (test/cli) + actors +32. + +## `async_lifo_cache` skipped side-effects + +- **Symptom**: a fn's cache-write side effect + (eg `get_mkt_info()` -> `_contracts`) missing for + a 2nd client/proxy. +- **Cause**: cache keys on POSITIONAL args only; a + hit skips the body entirely. +- **Rule**: never rely on cached-fn side effects; + perform required writes explicitly at the call + site (eg `cache_contract()` after `get_mkt_info`).