4.4 KiB
4.4 KiB
Known gotchas: symptom -> cause -> fix
A registry of distributed-runtime failure modes hit (and diagnosed) in the field; check here FIRST when a log/traceback matches.
“Can not order …, no qualified contract cached”
- Symptom:
RuntimeErrorfromib.api.Client.submit_limit()w/ emptyClient._contractsinbrokerd.ib. - Cause: per-actor cache never warmed; feed-side qualification now lives in
datad.ib. - Fix(ed): eager warmup at
open_trade_dialog()start + lazy per-orderget_mkt_info()+cache_contract()(writes BOTHmkt.bs_fqmeandmkt.fqmekeys; different consumers read each!).
Search returns results for the WRONG pattern
- Symptom: fqme search for ‘gld’ returns nvda results; next query returns the prior query’s set.
- Cause:
MethodProxychannel off-by-one — a caller cancelled (searchmove_on_aftertimeout) after sending its request orphans the response; every later caller consumes the previous resp. - Fix(ed):
midreq-id correlation in_run_method()+ relay; stale resps are dropped w/ a “Dropping stale method-resp” warning. If that warning spams, some caller is being cancelled mid-call habitually — find + fix its timeout.
One bad request crashes a whole dialog/actor
- Symptom:
TrioTaskExitedstorm + nursery teardown after a single method error (eg ambiguous contractAttributeError). - Cause: exception escaped the aio-side relay loop (
open_aio_client_method_relay()) killing channel + proxy nursery; caller-sidetry/exceptCANNOT catch it. - Fix(ed): relay catches
Exception-> ships{'exception': err, 'mid': ...}resp; order handler converts to EMSBrokerdErrormsgs.
Ambiguous ib contracts -> NoneType attr errors
- Symptom:
'NoneType' object has no attribute 'primaryExchange'infind_contracts(). - Cause:
qualifyContractsAsync()returnsNoneentries for ambiguous (eg venue-less stonk fqme matching multiple listings: ‘gld’ -> ARCA/USD + VENTURE/CAD). - Fix(ed): filter
Nones + raise descriptiveValueError(“use ‘gld.arca.ib’”).
Double-printed log records (same task id, 2x)
- Symptom: every record from some subsys printed twice w/ identical task ids.
- Cause: stderr handlers attached at TWO levels of one logger-propagation chain (eg daemon fixture on
piker.brokers.ib+ an ep callingget_console_log(name=__name__)on the child). tractor’s handler-dedup only checks the SAME logger, not ancestors. - Rule: console handlers are attached ONCE per actor in the
_setup_persistent_*()fixture; eps needing a different level uselog.setLevel()ONLY, neverget_console_log().
Bare/non-colorized log lines
- Symptom: records w/ no timestamp/actor prefix.
- Cause: NO handler anywhere in the emitting logger’s chain -> stdlib
logging.lastResort. Post actor-splits, a daemon fixture may only cover its own subsys subtree (eg datad’spiker.data.*but not the backend’spiker.brokers.<broker>.*). - Fix(ed):
_setup_persistent_datad()enables BOTHpiker.data.<broker>andpiker.brokers.<broker>subtrees.
2nd in-proc runtime boot wedges (~50%)
- Symptom: test hangs when one test proc boots a 2nd
pikerd(egtest_multi_fill_positions’s persistence re-check); a zombie*.{broker}child lingers w/ unread bytes in its parent-IPC Recv-Q. - Cause: pre-existing
tractor-main runtime teardown bug (confirmed independent of piker-layer changes via revert-testing 2026-06). - Mitigation: run suites per-file wrapped in
timeout -k 5 300 ...; retry once on rc 124/143. Do NOT chase as a regression of unrelated changes.
ib client-id collisions post-split
- Symptom: 2nd ib daemon burns the full conn-timeout retry cycle connecting to gw/tws.
- Cause:
datad.ib+brokerd.ibboth defaultclient_id=6116w/ linear+iretries. - Fix(ed): role-based offsets in
load_aio_clients(): datad +16, ad-hoc (test/cli) actors +32.
async_lifo_cache skipped side-effects
- Symptom: a fn’s cache-write side effect (eg
get_mkt_info()->_contracts) missing for a 2nd client/proxy. - Cause: cache keys on POSITIONAL args only; a hit skips the body entirely.
- Rule: never rely on cached-fn side effects; perform required writes explicitly at the call site (eg
cache_contract()afterget_mkt_info).