piker/.claude/skills/piker-conc-expert/gotchas.md

4.4 KiB
Raw Blame History

Known gotchas: symptom -> cause -> fix

A registry of distributed-runtime failure modes hit (and diagnosed) in the field; check here FIRST when a log/traceback matches.

“Can not order …, no qualified contract cached”

  • Symptom: RuntimeError from ib.api.Client.submit_limit() w/ empty Client._contracts in brokerd.ib.
  • Cause: per-actor cache never warmed; feed-side qualification now lives in datad.ib.
  • Fix(ed): eager warmup at open_trade_dialog() start + lazy per-order get_mkt_info() + cache_contract() (writes BOTH mkt.bs_fqme and mkt.fqme keys; different consumers read each!).

Search returns results for the WRONG pattern

  • Symptom: fqme search for gld returns nvda results; next query returns the prior querys set.
  • Cause: MethodProxy channel off-by-one — a caller cancelled (search move_on_after timeout) after sending its request orphans the response; every later caller consumes the previous resp.
  • Fix(ed): mid req-id correlation in _run_method() + relay; stale resps are dropped w/ a “Dropping stale method-resp” warning. If that warning spams, some caller is being cancelled mid-call habitually — find + fix its timeout.

One bad request crashes a whole dialog/actor

  • Symptom: TrioTaskExited storm + nursery teardown after a single method error (eg ambiguous contract AttributeError).
  • Cause: exception escaped the aio-side relay loop (open_aio_client_method_relay()) killing channel + proxy nursery; caller-side try/except CANNOT catch it.
  • Fix(ed): relay catches Exception -> ships {'exception': err, 'mid': ...} resp; order handler converts to EMS BrokerdError msgs.

Ambiguous ib contracts -> NoneType attr errors

  • Symptom: 'NoneType' object has no attribute 'primaryExchange' in find_contracts().
  • Cause: qualifyContractsAsync() returns None entries for ambiguous (eg venue-less stonk fqme matching multiple listings: gld -> ARCA/USD + VENTURE/CAD).
  • Fix(ed): filter Nones + raise descriptive ValueError (“use gld.arca.ib”).

Double-printed log records (same task id, 2x)

  • Symptom: every record from some subsys printed twice w/ identical task ids.
  • Cause: stderr handlers attached at TWO levels of one logger-propagation chain (eg daemon fixture on piker.brokers.ib + an ep calling get_console_log(name=__name__) on the child). tractors handler-dedup only checks the SAME logger, not ancestors.
  • Rule: console handlers are attached ONCE per actor in the _setup_persistent_*() fixture; eps needing a different level use log.setLevel() ONLY, never get_console_log().

Bare/non-colorized log lines

  • Symptom: records w/ no timestamp/actor prefix.
  • Cause: NO handler anywhere in the emitting loggers chain -> stdlib logging.lastResort. Post actor-splits, a daemon fixture may only cover its own subsys subtree (eg datads piker.data.* but not the backends piker.brokers.<broker>.*).
  • Fix(ed): _setup_persistent_datad() enables BOTH piker.data.<broker> and piker.brokers.<broker> subtrees.

2nd in-proc runtime boot wedges (~50%)

  • Symptom: test hangs when one test proc boots a 2nd pikerd (eg test_multi_fill_positionss persistence re-check); a zombie *.{broker} child lingers w/ unread bytes in its parent-IPC Recv-Q.
  • Cause: pre-existing tractor-main runtime teardown bug (confirmed independent of piker-layer changes via revert-testing 2026-06).
  • Mitigation: run suites per-file wrapped in timeout -k 5 300 ...; retry once on rc 124/143. Do NOT chase as a regression of unrelated changes.

ib client-id collisions post-split

  • Symptom: 2nd ib daemon burns the full conn-timeout retry cycle connecting to gw/tws.
  • Cause: datad.ib + brokerd.ib both default client_id=6116 w/ linear +i retries.
  • Fix(ed): role-based offsets in load_aio_clients(): datad +16, ad-hoc (test/cli) actors +32.

async_lifo_cache skipped side-effects

  • Symptom: a fns cache-write side effect (eg get_mkt_info() -> _contracts) missing for a 2nd client/proxy.
  • Cause: cache keys on POSITIONAL args only; a hit skips the body entirely.
  • Rule: never rely on cached-fn side effects; perform required writes explicitly at the call site (eg cache_contract() after get_mkt_info).