Known gotchas: symptom -> cause -> fix

A registry of distributed-runtime failure modes hit (and diagnosed) in the field; check here FIRST when a log/traceback matches.

“Can not order …, no qualified contract cached”

Symptom: RuntimeError from ib.api.Client.submit_limit() w/ empty Client._contracts in brokerd.ib.
Cause: per-actor cache never warmed; feed-side qualification now lives in datad.ib.
Fix(ed): eager warmup at open_trade_dialog() start + lazy per-order get_mkt_info() + cache_contract() (writes BOTH mkt.bs_fqme and mkt.fqme keys; different consumers read each!).

Symptom: fqme search for ‘gld’ returns nvda results; next query returns the prior query’s set.
Cause: MethodProxy channel off-by-one — a caller cancelled (search move_on_after timeout) after sending its request orphans the response; every later caller consumes the previous resp.
Fix(ed): mid req-id correlation in _run_method() + relay; stale resps are dropped w/ a “Dropping stale method-resp” warning. If that warning spams, some caller is being cancelled mid-call habitually — find + fix its timeout.

Symptom: TrioTaskExited storm + nursery teardown after a single method error (eg ambiguous contract AttributeError).
Cause: exception escaped the aio-side relay loop (open_aio_client_method_relay()) killing channel + proxy nursery; caller-side try/except CANNOT catch it.
Fix(ed): relay catches Exception -> ships {'exception': err, 'mid': ...} resp; order handler converts to EMS BrokerdError msgs.

Symptom: 'NoneType' object has no attribute 'primaryExchange' in find_contracts().
Cause: qualifyContractsAsync() returns None entries for ambiguous (eg venue-less stonk fqme matching multiple listings: ‘gld’ -> ARCA/USD + VENTURE/CAD).
Fix(ed): filter Nones + raise descriptive ValueError (“use ‘gld.arca.ib’”).

Symptom: every record from some subsys printed twice w/ identical task ids.
Cause: stderr handlers attached at TWO levels of one logger-propagation chain (eg daemon fixture on piker.brokers.ib + an ep calling get_console_log(name=__name__) on the child). tractor’s handler-dedup only checks the SAME logger, not ancestors.
Rule: console handlers are attached ONCE per actor in the _setup_persistent_*() fixture; eps needing a different level use log.setLevel() ONLY, never get_console_log().

Symptom: records w/ no timestamp/actor prefix.
Cause: NO handler anywhere in the emitting logger’s chain -> stdlib logging.lastResort. Post actor-splits, a daemon fixture may only cover its own subsys subtree (eg datad’s piker.data.* but not the backend’s piker.brokers.<broker>.*).
Fix(ed): _setup_persistent_datad() enables BOTH piker.data.<broker> and piker.brokers.<broker> subtrees.

Symptom: test hangs when one test proc boots a 2nd pikerd (eg test_multi_fill_positions’s persistence re-check); a zombie *.{broker} child lingers w/ unread bytes in its parent-IPC Recv-Q.
Cause: pre-existing tractor-main runtime teardown bug (confirmed independent of piker-layer changes via revert-testing 2026-06).
Mitigation: run suites per-file wrapped in timeout -k 5 300 ...; retry once on rc 124/143. Do NOT chase as a regression of unrelated changes.

Symptom: 2nd ib daemon burns the full conn-timeout retry cycle connecting to gw/tws.
Cause: datad.ib + brokerd.ib both default client_id=6116 w/ linear +i retries.
Fix(ed): role-based offsets in load_aio_clients(): datad +16, ad-hoc (test/cli) actors +32.

Symptom: a fn’s cache-write side effect (eg get_mkt_info() -> _contracts) missing for a 2nd client/proxy.
Cause: cache keys on POSITIONAL args only; a hit skips the body entirely.
Rule: never rely on cached-fn side effects; perform required writes explicitly at the call site (eg cache_contract() after get_mkt_info).