117 lines
4.4 KiB
Markdown
117 lines
4.4 KiB
Markdown
# Known gotchas: symptom -> cause -> fix
|
|
|
|
A registry of distributed-runtime failure modes hit
|
|
(and diagnosed) in the field; check here FIRST when a
|
|
log/traceback matches.
|
|
|
|
## "Can not order ..., no qualified contract cached"
|
|
|
|
- **Symptom**: `RuntimeError` from
|
|
`ib.api.Client.submit_limit()` w/ empty
|
|
`Client._contracts` in `brokerd.ib`.
|
|
- **Cause**: per-actor cache never warmed; feed-side
|
|
qualification now lives in `datad.ib`.
|
|
- **Fix(ed)**: eager warmup at `open_trade_dialog()`
|
|
start + lazy per-order `get_mkt_info()` +
|
|
`cache_contract()` (writes BOTH `mkt.bs_fqme` and
|
|
`mkt.fqme` keys; different consumers read each!).
|
|
|
|
## Search returns results for the WRONG pattern
|
|
|
|
- **Symptom**: fqme search for 'gld' returns nvda
|
|
results; next query returns the prior query's set.
|
|
- **Cause**: `MethodProxy` channel off-by-one — a
|
|
caller cancelled (search `move_on_after` timeout)
|
|
after sending its request orphans the response;
|
|
every later caller consumes the previous resp.
|
|
- **Fix(ed)**: `mid` req-id correlation in
|
|
`_run_method()` + relay; stale resps are dropped w/
|
|
a "Dropping stale method-resp" warning. If that
|
|
warning spams, some caller is being cancelled
|
|
mid-call habitually — find + fix its timeout.
|
|
|
|
## One bad request crashes a whole dialog/actor
|
|
|
|
- **Symptom**: `TrioTaskExited` storm + nursery
|
|
teardown after a single method error (eg ambiguous
|
|
contract `AttributeError`).
|
|
- **Cause**: exception escaped the aio-side relay
|
|
loop (`open_aio_client_method_relay()`) killing
|
|
channel + proxy nursery; caller-side `try/except`
|
|
CANNOT catch it.
|
|
- **Fix(ed)**: relay catches `Exception` -> ships
|
|
`{'exception': err, 'mid': ...}` resp; order
|
|
handler converts to EMS `BrokerdError` msgs.
|
|
|
|
## Ambiguous ib contracts -> `NoneType` attr errors
|
|
|
|
- **Symptom**: `'NoneType' object has no attribute
|
|
'primaryExchange'` in `find_contracts()`.
|
|
- **Cause**: `qualifyContractsAsync()` returns `None`
|
|
entries for ambiguous (eg venue-less stonk fqme
|
|
matching multiple listings: 'gld' -> ARCA/USD +
|
|
VENTURE/CAD).
|
|
- **Fix(ed)**: filter `None`s + raise descriptive
|
|
`ValueError` ("use 'gld.arca.ib'").
|
|
|
|
## Double-printed log records (same task id, 2x)
|
|
|
|
- **Symptom**: every record from some subsys printed
|
|
twice w/ identical task ids.
|
|
- **Cause**: stderr handlers attached at TWO levels
|
|
of one logger-propagation chain (eg daemon fixture
|
|
on `piker.brokers.ib` + an ep calling
|
|
`get_console_log(name=__name__)` on the child).
|
|
tractor's handler-dedup only checks the SAME
|
|
logger, not ancestors.
|
|
- **Rule**: console handlers are attached ONCE per
|
|
actor in the `_setup_persistent_*()` fixture; eps
|
|
needing a different level use `log.setLevel()`
|
|
ONLY, never `get_console_log()`.
|
|
|
|
## Bare/non-colorized log lines
|
|
|
|
- **Symptom**: records w/ no timestamp/actor prefix.
|
|
- **Cause**: NO handler anywhere in the emitting
|
|
logger's chain -> stdlib `logging.lastResort`. Post
|
|
actor-splits, a daemon fixture may only cover its
|
|
own subsys subtree (eg datad's `piker.data.*` but
|
|
not the backend's `piker.brokers.<broker>.*`).
|
|
- **Fix(ed)**: `_setup_persistent_datad()` enables
|
|
BOTH `piker.data.<broker>` and
|
|
`piker.brokers.<broker>` subtrees.
|
|
|
|
## 2nd in-proc runtime boot wedges (~50%)
|
|
|
|
- **Symptom**: test hangs when one test proc boots a
|
|
2nd `pikerd` (eg `test_multi_fill_positions`'s
|
|
persistence re-check); a zombie `*.{broker}` child
|
|
lingers w/ unread bytes in its parent-IPC Recv-Q.
|
|
- **Cause**: pre-existing `tractor`-main runtime
|
|
teardown bug (confirmed independent of piker-layer
|
|
changes via revert-testing 2026-06).
|
|
- **Mitigation**: run suites per-file wrapped in
|
|
`timeout -k 5 300 ...`; retry once on rc 124/143.
|
|
Do NOT chase as a regression of unrelated changes.
|
|
|
|
## ib client-id collisions post-split
|
|
|
|
- **Symptom**: 2nd ib daemon burns the full
|
|
conn-timeout retry cycle connecting to gw/tws.
|
|
- **Cause**: `datad.ib` + `brokerd.ib` both default
|
|
`client_id=6116` w/ linear `+i` retries.
|
|
- **Fix(ed)**: role-based offsets in
|
|
`load_aio_clients()`: datad +16, ad-hoc (test/cli)
|
|
actors +32.
|
|
|
|
## `async_lifo_cache` skipped side-effects
|
|
|
|
- **Symptom**: a fn's cache-write side effect
|
|
(eg `get_mkt_info()` -> `_contracts`) missing for
|
|
a 2nd client/proxy.
|
|
- **Cause**: cache keys on POSITIONAL args only; a
|
|
hit skips the body entirely.
|
|
- **Rule**: never rely on cached-fn side effects;
|
|
perform required writes explicitly at the call
|
|
site (eg `cache_contract()` after `get_mkt_info`).
|