Add `piker-conc-expert` claude-code skill
Distilled distributed-runtime + structured-concurrency expertise for the `tractor` actor-tree, auto-applied (not user-invocable) when working on daemon/service arch, RPC eps, `to_asyncio` integration, cancellation semantics or hang/wedge/skew debugging. Deats, - `SKILL.md`: the core mental model incl. the post-split actor-tree taxonomy (`datad`/`brokerd`/`emsd`/etc.), daemon lifecycle conventions, actor-local-state hazards and `tractor` primitive usage as deployed here. - `gotchas.md`: symptom -> cause -> fix entries distilled from this branch's (datad|brokerd)-split debugging (eg. un-warmed contract caches, stale IPC resps, double/bare log records, ib client-id collisions). - `debug-recipes.md`: actor-system forensics incl. wedged actor triage, hang-proof test gating and regression vs pre-existing attribution. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>datad_service
parent
0404e4230e
commit
b73300c820
|
|
@ -0,0 +1,150 @@
|
||||||
|
---
|
||||||
|
name: piker-conc-expert
|
||||||
|
description: >
|
||||||
|
Distributed-runtime and structured-concurrency
|
||||||
|
expertise for piker's `tractor` actor-tree. Apply
|
||||||
|
when working on daemon/service architecture, actor
|
||||||
|
spawning/discovery, cross-actor RPC (ctx/stream
|
||||||
|
eps), `to_asyncio` integration, cancellation
|
||||||
|
semantics, or debugging hangs/wedges/skews in the
|
||||||
|
actor system.
|
||||||
|
user-invocable: false
|
||||||
|
---
|
||||||
|
|
||||||
|
# Piker Concurrency & Runtime Expertise
|
||||||
|
|
||||||
|
The distilled mental model for piker's distributed
|
||||||
|
runtime: a `trio`-structured actor tree supervised by
|
||||||
|
`tractor` (pinned to git main) where every long-lived
|
||||||
|
subsystem is a named daemon-actor talking over
|
||||||
|
ctx/stream IPC.
|
||||||
|
|
||||||
|
## Actor tree & daemon taxonomy
|
||||||
|
|
||||||
|
```
|
||||||
|
pikerd root supervisor + registry
|
||||||
|
├── datad.<broker> feed bus, shm writers, tsp
|
||||||
|
│ history, symbol search
|
||||||
|
├── brokerd.<broker> live order-ctl ONLY; lazily
|
||||||
|
│ spawned by emsd, credentialed
|
||||||
|
├── emsd dark-clearing + order routing
|
||||||
|
│ └── paperboi.<broker> sim-clearing (paper mode)
|
||||||
|
└── samplerd singleton OHLC clock/increment
|
||||||
|
```
|
||||||
|
|
||||||
|
Key invariants:
|
||||||
|
- `datad` hosts all `piker.data.validate._eps['datad']`
|
||||||
|
eps; `brokerd` only the `['brokerd']` (order-ctl)
|
||||||
|
ones. The `_eps` table in `piker/data/validate.py`
|
||||||
|
is the authoritative contract; `get_eps(mod, kind)`
|
||||||
|
introspects a backend's support.
|
||||||
|
- `brokerd.<broker>` is booted in EXACTLY one place:
|
||||||
|
`open_brokerd_dialog()` in `piker/clearing/_ems.py`
|
||||||
|
(with a `portal:` override for the `piker ledger`
|
||||||
|
ad-hoc actor). Chart-only + paper sessions run with
|
||||||
|
ZERO brokerd procs. Never add a data-path spawn!
|
||||||
|
- backends declare per-daemon-kind submods via
|
||||||
|
`_datad_mods`/`_brokerd_mods` in their
|
||||||
|
`__init__.py` (fallback: `__enable_modules__`).
|
||||||
|
|
||||||
|
## Daemon lifecycle conventions
|
||||||
|
|
||||||
|
Every daemon-kind follows the same trio of fns (see
|
||||||
|
`piker/brokers/_daemon.py` + `piker/data/_daemon.py`
|
||||||
|
as the canonical pair):
|
||||||
|
|
||||||
|
- `_setup_persistent_<kind>()`: a `@tractor.context`
|
||||||
|
"lifetime fixture" run via
|
||||||
|
`Services.start_service_task()`; does console-log
|
||||||
|
setup ONCE for the actor, allocs any actor-global
|
||||||
|
state (eg. datad's `_FeedsBus`), then
|
||||||
|
`await ctx.started()` + `trio.sleep_forever()`.
|
||||||
|
- `<kind>_init()`: builds `enable_modules` + actor
|
||||||
|
name `f'<kind>.{brokername}'` and copies backend
|
||||||
|
`_spawn_kwargs` (CRITICAL: `ib` needs
|
||||||
|
`infect_asyncio=True` in EVERY daemon-kind).
|
||||||
|
- `spawn_<kind>()` + `maybe_spawn_<kind>()`: thin
|
||||||
|
wrappers over `Services.actor_n.start_actor()` and
|
||||||
|
`piker.service.maybe_spawn_daemon()` (registry
|
||||||
|
find-or-spawn w/ per-service-name locking).
|
||||||
|
|
||||||
|
Caps-sec model: `enable_modules` gates RPC entry ONLY
|
||||||
|
— python imports are unrestricted in-proc. Keep each
|
||||||
|
daemon's enable set minimal; the (credentialed)
|
||||||
|
`brokerd` must never RPC-enable `piker.data.*` feed
|
||||||
|
mods.
|
||||||
|
|
||||||
|
## Actor-local state: the #1 split hazard
|
||||||
|
|
||||||
|
Module-globals and instance caches are PER-ACTOR.
|
||||||
|
Anything that "just worked" because two subsystems
|
||||||
|
shared a process will break when they're split into
|
||||||
|
sibling actors. Canonical example: `ib`'s
|
||||||
|
`Client._contracts` was warmed by feed-side
|
||||||
|
`get_mkt_info()` in-proc; post datad/brokerd-split
|
||||||
|
the trading actor must warm it itself (eagerly at
|
||||||
|
`open_trade_dialog()` startup for open pps/orders +
|
||||||
|
lazily per order request via
|
||||||
|
`symbols.cache_contract()`).
|
||||||
|
|
||||||
|
When moving code across actor boundaries ALWAYS audit:
|
||||||
|
- module-global registries (`feed._bus`,
|
||||||
|
`_accounts2clients`, `_client_cache`, ..)
|
||||||
|
- `@async_lifo_cache`/`maybe_open_context` caches
|
||||||
|
(NOTE: `async_lifo_cache` keys on POSITIONAL args
|
||||||
|
only; a cache-hit SKIPS the fn body and thus any
|
||||||
|
side-effect writes!)
|
||||||
|
- logging handler placement (see gotchas.md)
|
||||||
|
|
||||||
|
## tractor primitives as used here
|
||||||
|
|
||||||
|
- `@tractor.context` eps: `await ctx.started(val)`
|
||||||
|
unblocks the caller w/ `val`; long-lived eps then
|
||||||
|
`ctx.open_stream()` or `sleep_forever()`.
|
||||||
|
- discovery: `tractor.find_actor()` via
|
||||||
|
`piker.service.find_service()`;
|
||||||
|
`wait_for_actor(name, registry_addr=...)`;
|
||||||
|
`query_actor(name, regaddr=...)` yields
|
||||||
|
`(sockaddr, portal)`. Addrs are wrapped
|
||||||
|
`tractor.discovery._addr.Address` types — use
|
||||||
|
`wrap_address()` to normalize raw tuples and
|
||||||
|
`.unwrap()` for comparisons.
|
||||||
|
- runtime-vars: `_runtime_vars['piker_vars']` is
|
||||||
|
inherited down the spawn tree; used eg. for
|
||||||
|
`piker_test_dir` config isolation — read LAZILY at
|
||||||
|
use-time, never at import time (subactors only get
|
||||||
|
vars post runtime-boot).
|
||||||
|
- cancellation semantics (modern tractor): a
|
||||||
|
`ContextCancelled` whose `.canceller` is your own
|
||||||
|
actor is ABSORBED (clean exit, nothing raised);
|
||||||
|
single-exc groups collapse (`collapse_eg`) so eg.
|
||||||
|
a KBI propagates bare. Exc attrs:
|
||||||
|
`RemoteActorError.boxed_type` (not `.type`).
|
||||||
|
|
||||||
|
## `to_asyncio` (infect-asyncio) integration
|
||||||
|
|
||||||
|
For `ib` (and `deribit`) the backend client runs on
|
||||||
|
an embedded `asyncio` loop via
|
||||||
|
`tractor.to_asyncio.open_channel_from()` +
|
||||||
|
`LinkedTaskChannel`.
|
||||||
|
|
||||||
|
Rules learned the hard way:
|
||||||
|
- a shared req/resp channel MUST correlate responses
|
||||||
|
to requests (see `MethodProxy._run_method()`'s
|
||||||
|
`mid` protocol in `piker/brokers/ib/api.py`):
|
||||||
|
caller cancellation (eg. `move_on_after` timeouts)
|
||||||
|
otherwise orphans a response and silently skews
|
||||||
|
every later result off-by-one.
|
||||||
|
- the aio-side relay must catch + ship back ALL
|
||||||
|
(non-cancel) exceptions as `{'exception': err}`
|
||||||
|
resps; an escaping error kills the relay task ->
|
||||||
|
channel -> proxy nursery -> the whole dialog,
|
||||||
|
bypassing every caller-side guard.
|
||||||
|
- `TrioTaskExited` ("child asyncio task is still
|
||||||
|
running?") on teardown is a known wart family;
|
||||||
|
prefer upstream `tractor` fixes over piker-side
|
||||||
|
bandaids.
|
||||||
|
|
||||||
|
See [gotchas.md](gotchas.md) for the symptom->cause
|
||||||
|
registry and [debug-recipes.md](debug-recipes.md) for
|
||||||
|
forensics techniques.
|
||||||
|
|
@ -0,0 +1,100 @@
|
||||||
|
# Debug recipes: actor-system forensics
|
||||||
|
|
||||||
|
Field-tested techniques for diagnosing hangs, wedges
|
||||||
|
and cross-actor state bugs WITHOUT a debugger attached
|
||||||
|
(or when `py-spy` ain't installed).
|
||||||
|
|
||||||
|
## Wedged actor triage (no REPL)
|
||||||
|
|
||||||
|
1. find the tree:
|
||||||
|
`ps -eo pid,etime,args | grep -E 'pytest|tractor._child'`
|
||||||
|
— long-`etime` `tractor._child` procs w/ a stuck
|
||||||
|
parent = wedge.
|
||||||
|
2. kernel state:
|
||||||
|
`cat /proc/<pid>/wchan` + `status | grep -E
|
||||||
|
'State|Threads'` — `do_epoll_wait` + sleeping =
|
||||||
|
idle event loop, NOT cpu-spin.
|
||||||
|
3. **the money read** — socket queues:
|
||||||
|
`ss -tnp | grep <pid>`
|
||||||
|
- `Recv-Q > 0` on the parent-IPC conn = the actor
|
||||||
|
STOPPED CONSUMING its msg loop (runtime bug),
|
||||||
|
parent is waiting on it.
|
||||||
|
- zero external (api/ws) conns = wedged before/
|
||||||
|
without provider IO; don't blame the network.
|
||||||
|
- `CLOSE-WAIT` lingerers = unclean peer teardown.
|
||||||
|
4. cleanup: `pkill -f tractor._child` (NB: in
|
||||||
|
compound shell cmds `pkill`'s exit code poisons
|
||||||
|
`&&` chains — run it standalone).
|
||||||
|
|
||||||
|
## Hang-proof test gating
|
||||||
|
|
||||||
|
- per-suite, never combined (cross-suite session
|
||||||
|
state interacts w/ the 2nd-boot wedge):
|
||||||
|
`timeout -k 5 300 python -m pytest tests/<one>.py -q`
|
||||||
|
- rc 124/143 = hang-kill -> retry ONCE before
|
||||||
|
investigating.
|
||||||
|
- isolate a flaky test w/ a 3x loop; ~50% hit-rate
|
||||||
|
signatures match the known 2nd-boot wedge (see
|
||||||
|
gotchas.md).
|
||||||
|
|
||||||
|
## Regression vs pre-existing attribution
|
||||||
|
|
||||||
|
When a failure appears mid-refactor:
|
||||||
|
1. `git stash -u` (or checkout the file subset) and
|
||||||
|
re-run the EXACT failing case at baseline.
|
||||||
|
2. if baseline can't even run, selectively revert
|
||||||
|
ONLY the suspect layer:
|
||||||
|
`git diff <files> > /tmp/x.patch;
|
||||||
|
git checkout <files>` -> test ->
|
||||||
|
`git apply /tmp/x.patch`.
|
||||||
|
3. flake-rate compare (3x runs) beats single-shot
|
||||||
|
conclusions.
|
||||||
|
|
||||||
|
## Off-by-one / stale IPC resp detection
|
||||||
|
|
||||||
|
Mismatched query->result content in logs (resp
|
||||||
|
payload obviously for a prior request) = shared
|
||||||
|
req/resp channel w/o correlation + a cancelled
|
||||||
|
caller. Grep the ep for `move_on_after`/`fail_after`
|
||||||
|
around proxied calls. Fix = req-id (`mid`) tagging,
|
||||||
|
never "just a lock" (cancellation still orphans).
|
||||||
|
|
||||||
|
## Logging-chain audits
|
||||||
|
|
||||||
|
When records double-print or go bare (see gotchas.md):
|
||||||
|
|
||||||
|
```python
|
||||||
|
import logging
|
||||||
|
l = logging.getLogger('piker.brokers.ib.broker')
|
||||||
|
while l:
|
||||||
|
print(l.name, l.level, l.handlers, l.propagate)
|
||||||
|
l = l.parent
|
||||||
|
```
|
||||||
|
|
||||||
|
Exactly ONE stderr handler should exist in the chain,
|
||||||
|
attached by the actor's daemon fixture.
|
||||||
|
|
||||||
|
## Live actor-tree smoke (headless)
|
||||||
|
|
||||||
|
Boot against an ALT registry port so a user's running
|
||||||
|
stack is untouched; script in a REAL file (tractor
|
||||||
|
children re-exec `__main__` from path — stdin scripts
|
||||||
|
crash w/ `FileNotFoundError: .../<stdin>`):
|
||||||
|
|
||||||
|
```python
|
||||||
|
async with maybe_open_pikerd(
|
||||||
|
registry_addrs=[('127.0.0.1', 6979)],
|
||||||
|
):
|
||||||
|
async with open_feed(['xbtusdt.kraken']) as feed:
|
||||||
|
assert await check_for_service('datad.kraken')
|
||||||
|
assert not await check_for_service(
|
||||||
|
'brokerd.kraken'
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
## In-proc fail-fast unit checks
|
||||||
|
|
||||||
|
Spawn-path guards that raise BEFORE touching the
|
||||||
|
runtime can be tested w/ a bare `trio.run()` (eg
|
||||||
|
`spawn_brokerd('kucoin')` raising the datad-only
|
||||||
|
error) — no pikerd needed.
|
||||||
|
|
@ -0,0 +1,116 @@
|
||||||
|
# Known gotchas: symptom -> cause -> fix
|
||||||
|
|
||||||
|
A registry of distributed-runtime failure modes hit
|
||||||
|
(and diagnosed) in the field; check here FIRST when a
|
||||||
|
log/traceback matches.
|
||||||
|
|
||||||
|
## "Can not order ..., no qualified contract cached"
|
||||||
|
|
||||||
|
- **Symptom**: `RuntimeError` from
|
||||||
|
`ib.api.Client.submit_limit()` w/ empty
|
||||||
|
`Client._contracts` in `brokerd.ib`.
|
||||||
|
- **Cause**: per-actor cache never warmed; feed-side
|
||||||
|
qualification now lives in `datad.ib`.
|
||||||
|
- **Fix(ed)**: eager warmup at `open_trade_dialog()`
|
||||||
|
start + lazy per-order `get_mkt_info()` +
|
||||||
|
`cache_contract()` (writes BOTH `mkt.bs_fqme` and
|
||||||
|
`mkt.fqme` keys; different consumers read each!).
|
||||||
|
|
||||||
|
## Search returns results for the WRONG pattern
|
||||||
|
|
||||||
|
- **Symptom**: fqme search for 'gld' returns nvda
|
||||||
|
results; next query returns the prior query's set.
|
||||||
|
- **Cause**: `MethodProxy` channel off-by-one — a
|
||||||
|
caller cancelled (search `move_on_after` timeout)
|
||||||
|
after sending its request orphans the response;
|
||||||
|
every later caller consumes the previous resp.
|
||||||
|
- **Fix(ed)**: `mid` req-id correlation in
|
||||||
|
`_run_method()` + relay; stale resps are dropped w/
|
||||||
|
a "Dropping stale method-resp" warning. If that
|
||||||
|
warning spams, some caller is being cancelled
|
||||||
|
mid-call habitually — find + fix its timeout.
|
||||||
|
|
||||||
|
## One bad request crashes a whole dialog/actor
|
||||||
|
|
||||||
|
- **Symptom**: `TrioTaskExited` storm + nursery
|
||||||
|
teardown after a single method error (eg ambiguous
|
||||||
|
contract `AttributeError`).
|
||||||
|
- **Cause**: exception escaped the aio-side relay
|
||||||
|
loop (`open_aio_client_method_relay()`) killing
|
||||||
|
channel + proxy nursery; caller-side `try/except`
|
||||||
|
CANNOT catch it.
|
||||||
|
- **Fix(ed)**: relay catches `Exception` -> ships
|
||||||
|
`{'exception': err, 'mid': ...}` resp; order
|
||||||
|
handler converts to EMS `BrokerdError` msgs.
|
||||||
|
|
||||||
|
## Ambiguous ib contracts -> `NoneType` attr errors
|
||||||
|
|
||||||
|
- **Symptom**: `'NoneType' object has no attribute
|
||||||
|
'primaryExchange'` in `find_contracts()`.
|
||||||
|
- **Cause**: `qualifyContractsAsync()` returns `None`
|
||||||
|
entries for ambiguous (eg venue-less stonk fqme
|
||||||
|
matching multiple listings: 'gld' -> ARCA/USD +
|
||||||
|
VENTURE/CAD).
|
||||||
|
- **Fix(ed)**: filter `None`s + raise descriptive
|
||||||
|
`ValueError` ("use 'gld.arca.ib'").
|
||||||
|
|
||||||
|
## Double-printed log records (same task id, 2x)
|
||||||
|
|
||||||
|
- **Symptom**: every record from some subsys printed
|
||||||
|
twice w/ identical task ids.
|
||||||
|
- **Cause**: stderr handlers attached at TWO levels
|
||||||
|
of one logger-propagation chain (eg daemon fixture
|
||||||
|
on `piker.brokers.ib` + an ep calling
|
||||||
|
`get_console_log(name=__name__)` on the child).
|
||||||
|
tractor's handler-dedup only checks the SAME
|
||||||
|
logger, not ancestors.
|
||||||
|
- **Rule**: console handlers are attached ONCE per
|
||||||
|
actor in the `_setup_persistent_*()` fixture; eps
|
||||||
|
needing a different level use `log.setLevel()`
|
||||||
|
ONLY, never `get_console_log()`.
|
||||||
|
|
||||||
|
## Bare/non-colorized log lines
|
||||||
|
|
||||||
|
- **Symptom**: records w/ no timestamp/actor prefix.
|
||||||
|
- **Cause**: NO handler anywhere in the emitting
|
||||||
|
logger's chain -> stdlib `logging.lastResort`. Post
|
||||||
|
actor-splits, a daemon fixture may only cover its
|
||||||
|
own subsys subtree (eg datad's `piker.data.*` but
|
||||||
|
not the backend's `piker.brokers.<broker>.*`).
|
||||||
|
- **Fix(ed)**: `_setup_persistent_datad()` enables
|
||||||
|
BOTH `piker.data.<broker>` and
|
||||||
|
`piker.brokers.<broker>` subtrees.
|
||||||
|
|
||||||
|
## 2nd in-proc runtime boot wedges (~50%)
|
||||||
|
|
||||||
|
- **Symptom**: test hangs when one test proc boots a
|
||||||
|
2nd `pikerd` (eg `test_multi_fill_positions`'s
|
||||||
|
persistence re-check); a zombie `*.{broker}` child
|
||||||
|
lingers w/ unread bytes in its parent-IPC Recv-Q.
|
||||||
|
- **Cause**: pre-existing `tractor`-main runtime
|
||||||
|
teardown bug (confirmed independent of piker-layer
|
||||||
|
changes via revert-testing 2026-06).
|
||||||
|
- **Mitigation**: run suites per-file wrapped in
|
||||||
|
`timeout -k 5 300 ...`; retry once on rc 124/143.
|
||||||
|
Do NOT chase as a regression of unrelated changes.
|
||||||
|
|
||||||
|
## ib client-id collisions post-split
|
||||||
|
|
||||||
|
- **Symptom**: 2nd ib daemon burns the full
|
||||||
|
conn-timeout retry cycle connecting to gw/tws.
|
||||||
|
- **Cause**: `datad.ib` + `brokerd.ib` both default
|
||||||
|
`client_id=6116` w/ linear `+i` retries.
|
||||||
|
- **Fix(ed)**: role-based offsets in
|
||||||
|
`load_aio_clients()`: datad +16, ad-hoc (test/cli)
|
||||||
|
actors +32.
|
||||||
|
|
||||||
|
## `async_lifo_cache` skipped side-effects
|
||||||
|
|
||||||
|
- **Symptom**: a fn's cache-write side effect
|
||||||
|
(eg `get_mkt_info()` -> `_contracts`) missing for
|
||||||
|
a 2nd client/proxy.
|
||||||
|
- **Cause**: cache keys on POSITIONAL args only; a
|
||||||
|
hit skips the body entirely.
|
||||||
|
- **Rule**: never rely on cached-fn side effects;
|
||||||
|
perform required writes explicitly at the call
|
||||||
|
site (eg `cache_contract()` after `get_mkt_info`).
|
||||||
Loading…
Reference in New Issue