piker/.claude/skills/piker-conc-expert/SKILL.md

151 lines
5.6 KiB
Markdown
Raw Normal View History

---
name: piker-conc-expert
description: >
Distributed-runtime and structured-concurrency
expertise for piker's `tractor` actor-tree. Apply
when working on daemon/service architecture, actor
spawning/discovery, cross-actor RPC (ctx/stream
eps), `to_asyncio` integration, cancellation
semantics, or debugging hangs/wedges/skews in the
actor system.
user-invocable: false
---
# Piker Concurrency & Runtime Expertise
The distilled mental model for piker's distributed
runtime: a `trio`-structured actor tree supervised by
`tractor` (pinned to git main) where every long-lived
subsystem is a named daemon-actor talking over
ctx/stream IPC.
## Actor tree & daemon taxonomy
```
pikerd root supervisor + registry
├── datad.<broker> feed bus, shm writers, tsp
│ history, symbol search
├── brokerd.<broker> live order-ctl ONLY; lazily
│ spawned by emsd, credentialed
├── emsd dark-clearing + order routing
│ └── paperboi.<broker> sim-clearing (paper mode)
└── samplerd singleton OHLC clock/increment
```
Key invariants:
- `datad` hosts all `piker.data.validate._eps['datad']`
eps; `brokerd` only the `['brokerd']` (order-ctl)
ones. The `_eps` table in `piker/data/validate.py`
is the authoritative contract; `get_eps(mod, kind)`
introspects a backend's support.
- `brokerd.<broker>` is booted in EXACTLY one place:
`open_brokerd_dialog()` in `piker/clearing/_ems.py`
(with a `portal:` override for the `piker ledger`
ad-hoc actor). Chart-only + paper sessions run with
ZERO brokerd procs. Never add a data-path spawn!
- backends declare per-daemon-kind submods via
`_datad_mods`/`_brokerd_mods` in their
`__init__.py` (fallback: `__enable_modules__`).
## Daemon lifecycle conventions
Every daemon-kind follows the same trio of fns (see
`piker/brokers/_daemon.py` + `piker/data/_daemon.py`
as the canonical pair):
- `_setup_persistent_<kind>()`: a `@tractor.context`
"lifetime fixture" run via
`Services.start_service_task()`; does console-log
setup ONCE for the actor, allocs any actor-global
state (eg. datad's `_FeedsBus`), then
`await ctx.started()` + `trio.sleep_forever()`.
- `<kind>_init()`: builds `enable_modules` + actor
name `f'<kind>.{brokername}'` and copies backend
`_spawn_kwargs` (CRITICAL: `ib` needs
`infect_asyncio=True` in EVERY daemon-kind).
- `spawn_<kind>()` + `maybe_spawn_<kind>()`: thin
wrappers over `Services.actor_n.start_actor()` and
`piker.service.maybe_spawn_daemon()` (registry
find-or-spawn w/ per-service-name locking).
Caps-sec model: `enable_modules` gates RPC entry ONLY
— python imports are unrestricted in-proc. Keep each
daemon's enable set minimal; the (credentialed)
`brokerd` must never RPC-enable `piker.data.*` feed
mods.
## Actor-local state: the #1 split hazard
Module-globals and instance caches are PER-ACTOR.
Anything that "just worked" because two subsystems
shared a process will break when they're split into
sibling actors. Canonical example: `ib`'s
`Client._contracts` was warmed by feed-side
`get_mkt_info()` in-proc; post datad/brokerd-split
the trading actor must warm it itself (eagerly at
`open_trade_dialog()` startup for open pps/orders +
lazily per order request via
`symbols.cache_contract()`).
When moving code across actor boundaries ALWAYS audit:
- module-global registries (`feed._bus`,
`_accounts2clients`, `_client_cache`, ..)
- `@async_lifo_cache`/`maybe_open_context` caches
(NOTE: `async_lifo_cache` keys on POSITIONAL args
only; a cache-hit SKIPS the fn body and thus any
side-effect writes!)
- logging handler placement (see gotchas.md)
## tractor primitives as used here
- `@tractor.context` eps: `await ctx.started(val)`
unblocks the caller w/ `val`; long-lived eps then
`ctx.open_stream()` or `sleep_forever()`.
- discovery: `tractor.find_actor()` via
`piker.service.find_service()`;
`wait_for_actor(name, registry_addr=...)`;
`query_actor(name, regaddr=...)` yields
`(sockaddr, portal)`. Addrs are wrapped
`tractor.discovery._addr.Address` types — use
`wrap_address()` to normalize raw tuples and
`.unwrap()` for comparisons.
- runtime-vars: `_runtime_vars['piker_vars']` is
inherited down the spawn tree; used eg. for
`piker_test_dir` config isolation — read LAZILY at
use-time, never at import time (subactors only get
vars post runtime-boot).
- cancellation semantics (modern tractor): a
`ContextCancelled` whose `.canceller` is your own
actor is ABSORBED (clean exit, nothing raised);
single-exc groups collapse (`collapse_eg`) so eg.
a KBI propagates bare. Exc attrs:
`RemoteActorError.boxed_type` (not `.type`).
## `to_asyncio` (infect-asyncio) integration
For `ib` (and `deribit`) the backend client runs on
an embedded `asyncio` loop via
`tractor.to_asyncio.open_channel_from()` +
`LinkedTaskChannel`.
Rules learned the hard way:
- a shared req/resp channel MUST correlate responses
to requests (see `MethodProxy._run_method()`'s
`mid` protocol in `piker/brokers/ib/api.py`):
caller cancellation (eg. `move_on_after` timeouts)
otherwise orphans a response and silently skews
every later result off-by-one.
- the aio-side relay must catch + ship back ALL
(non-cancel) exceptions as `{'exception': err}`
resps; an escaping error kills the relay task ->
channel -> proxy nursery -> the whole dialog,
bypassing every caller-side guard.
- `TrioTaskExited` ("child asyncio task is still
running?") on teardown is a known wart family;
prefer upstream `tractor` fixes over piker-side
bandaids.
See [gotchas.md](gotchas.md) for the symptom->cause
registry and [debug-recipes.md](debug-recipes.md) for
forensics techniques.