piker/.claude/skills/timeseries_numpy_polars_opt...

457 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

# Timeseries Optimization: NumPy & Polars
Skill for high-performance timeseries processing using NumPy
and Polars, with focus on patterns common in financial/trading
applications.
## Core Principle: Vectorization Over Iteration
**Never write Python loops over large arrays.**
Always look for vectorized alternatives.
```python
# BAD: Python loop (slow!)
results = []
for i in range(len(array)):
if array['time'][i] == target_time:
results.append(array[i])
# GOOD: vectorized boolean indexing (fast!)
results = array[array['time'] == target_time]
```
## NumPy Structured Arrays
Piker uses structured arrays for OHLCV data:
```python
# typical piker array dtype
dtype = [
('index', 'i8'), # absolute sequence index
('time', 'f8'), # unix epoch timestamp
('open', 'f8'),
('high', 'f8'),
('low', 'f8'),
('close', 'f8'),
('volume', 'f8'),
]
arr = np.array([(0, 1234.0, 100, 101, 99, 100.5, 1000)],
dtype=dtype)
# field access
times = arr['time'] # returns view, not copy
closes = arr['close']
```
### Structured Array Performance Gotchas
**1. Field access in loops is slow**
```python
# BAD: repeated struct field access per iteration
for i, row in enumerate(arr):
x = row['index'] # struct access per iteration!
y = row['close']
process(x, y)
# GOOD: extract fields once, iterate plain arrays
indices = arr['index'] # extract once
closes = arr['close']
for i in range(len(arr)):
x = indices[i] # plain array indexing
y = closes[i]
process(x, y)
```
**2. Dict comprehensions with struct arrays**
```python
# SLOW: field access per row in Python loop
time_to_row = {
float(row['time']): {
'index': float(row['index']),
'close': float(row['close']),
}
for row in matched_rows # struct field access!
}
# FAST: extract to plain arrays first
times = matched_rows['time'].astype(float)
indices = matched_rows['index'].astype(float)
closes = matched_rows['close'].astype(float)
time_to_row = {
t: {'index': idx, 'close': cls}
for t, idx, cls in zip(times, indices, closes)
}
```
## Timestamp Lookup Patterns
### Linear Scan (O(n)) - Avoid!
```python
# BAD: O(n) scan through entire array
for target_ts in timestamps: # m iterations
matches = array[array['time'] == target_ts] # O(n) scan
# Total: O(m * n) - catastrophic for large datasets!
```
**Performance:**
- 1000 lookups × 10k array = 10M comparisons
- Timing: ~50-100ms for 1k lookups
### Binary Search (O(log n)) - Good!
```python
# GOOD: O(m log n) using searchsorted
import numpy as np
time_arr = array['time'] # extract once
ts_array = np.array(timestamps)
# binary search for all timestamps at once
indices = np.searchsorted(time_arr, ts_array)
# bounds check and exact match verification
valid_mask = (
(indices < len(array))
&
(time_arr[indices] == ts_array)
)
valid_indices = indices[valid_mask]
matched_rows = array[valid_indices]
```
**Requirements for `searchsorted()`:**
- Input array MUST be sorted (ascending by default)
- Works on any sortable dtype (floats, ints, etc)
- Returns insertion indices (not found = len(array))
**Performance:**
- 1000 lookups × 10k array = ~10k comparisons
- Timing: <1ms for 1k lookups
- **~100-1000x faster than linear scan**
### Hash Table (O(1)) - Best for Multiple Lookups!
If you'll do many lookups on same array, build dict once:
```python
# build lookup once
time_to_idx = {
float(array['time'][i]): i
for i in range(len(array))
}
# O(1) lookups
for target_ts in timestamps:
idx = time_to_idx.get(target_ts)
if idx is not None:
row = array[idx]
```
**When to use:**
- Many repeated lookups on same array
- Array doesn't change between lookups
- Can afford upfront dict building cost
## Vectorized Boolean Operations
### Basic Filtering
```python
# single condition
recent = array[array['time'] > cutoff_time]
# multiple conditions with &, |
filtered = array[
(array['time'] > start_time)
&
(array['time'] < end_time)
&
(array['volume'] > min_volume)
]
# IMPORTANT: parentheses required around each condition!
# (operator precedence: & binds tighter than >)
```
### Fancy Indexing
```python
# boolean mask
mask = array['close'] > array['open'] # up bars
up_bars = array[mask]
# integer indices
indices = np.array([0, 5, 10, 15])
selected = array[indices]
# combine boolean + fancy indexing
mask = array['volume'] > threshold
high_vol_indices = np.where(mask)[0]
subset = array[high_vol_indices[::2]] # every other
```
## Common Financial Patterns
### Gap Detection
```python
# assume sorted by time
time_diffs = np.diff(array['time'])
expected_step = 60.0 # 1-minute bars
# find gaps larger than expected
gap_mask = time_diffs > (expected_step * 1.5)
gap_indices = np.where(gap_mask)[0]
# get gap start/end times
gap_starts = array['time'][gap_indices]
gap_ends = array['time'][gap_indices + 1]
```
### Rolling Window Operations
```python
# simple moving average (close)
window = 20
sma = np.convolve(
array['close'],
np.ones(window) / window,
mode='valid',
)
# alternatively, use stride tricks for efficiency
from numpy.lib.stride_tricks import sliding_window_view
windows = sliding_window_view(array['close'], window)
sma = windows.mean(axis=1)
```
### OHLC Resampling (NumPy)
```python
# resample 1m bars to 5m bars
def resample_ohlc(arr, old_step, new_step):
n_bars = len(arr)
factor = int(new_step / old_step)
# truncate to multiple of factor
n_complete = (n_bars // factor) * factor
arr = arr[:n_complete]
# reshape into chunks
reshaped = arr.reshape(-1, factor)
# aggregate OHLC
opens = reshaped[:, 0]['open']
highs = reshaped['high'].max(axis=1)
lows = reshaped['low'].min(axis=1)
closes = reshaped[:, -1]['close']
volumes = reshaped['volume'].sum(axis=1)
return np.rec.fromarrays(
[opens, highs, lows, closes, volumes],
names=['open', 'high', 'low', 'close', 'volume'],
)
```
## Polars Integration
Piker is transitioning to Polars for some operations.
### NumPy ↔ Polars Conversion
```python
import polars as pl
# numpy to polars
df = pl.from_numpy(
arr,
schema=['index', 'time', 'open', 'high', 'low', 'close', 'volume'],
)
# polars to numpy (via arrow)
arr = df.to_numpy()
# piker convenience
from piker.tsp import np2pl, pl2np
df = np2pl(arr)
arr = pl2np(df)
```
### Polars Performance Patterns
**Lazy evaluation:**
```python
# build query lazily
lazy_df = (
df.lazy()
.filter(pl.col('volume') > 1000)
.with_columns([
(pl.col('close') - pl.col('open')).alias('change')
])
.sort('time')
)
# execute once
result = lazy_df.collect()
```
**Groupby aggregations:**
```python
# resample to 5-minute bars
resampled = df.groupby_dynamic(
index_column='time',
every='5m',
).agg([
pl.col('open').first(),
pl.col('high').max(),
pl.col('low').min(),
pl.col('close').last(),
pl.col('volume').sum(),
])
```
### When to Use Polars vs NumPy
**Use Polars when:**
- Complex queries with multiple filters/joins
- Need SQL-like operations (groupby, window functions)
- Working with heterogeneous column types
- Want lazy evaluation optimization
**Use NumPy when:**
- Simple array operations (indexing, slicing)
- Direct memory access needed (e.g., SHM arrays)
- Compatibility with Qt/pyqtgraph (expects NumPy)
- Maximum performance for numerical computation
## Memory Considerations
### Views vs Copies
```python
# VIEW: shares memory (fast, no copy)
times = array['time'] # field access
subset = array[10:20] # slicing
reshaped = array.reshape(-1, 2)
# COPY: new memory allocation
filtered = array[array['time'] > cutoff] # boolean indexing
sorted_arr = np.sort(array) # sorting
casted = array.astype(np.float32) # type conversion
# force copy when needed
explicit_copy = array.copy()
```
### In-Place Operations
```python
# modify in-place (no new allocation)
array['close'] *= 1.01 # scale prices
array['volume'][mask] = 0 # zero out specific rows
# careful: compound operations may create temporaries
array['close'] = array['close'] * 1.01 # creates temp!
array['close'] *= 1.01 # true in-place
```
## Performance Checklist
When optimizing timeseries operations:
- [ ] Is the array sorted? (enables binary search)
- [ ] Are you doing repeated lookups? (build hash table)
- [ ] Are struct fields accessed in loops? (extract to plain arrays)
- [ ] Are you using boolean indexing? (vectorized vs loop)
- [ ] Can operations be batched? (minimize round-trips)
- [ ] Is memory being copied unnecessarily? (use views)
- [ ] Are you using the right tool? (NumPy vs Polars)
## Common Bottlenecks and Fixes
### Bottleneck: Timestamp Lookups
```python
# BEFORE: O(n*m) - 100ms for 1k lookups
for ts in timestamps:
matches = array[array['time'] == ts]
# AFTER: O(m log n) - <1ms for 1k lookups
indices = np.searchsorted(array['time'], timestamps)
```
### Bottleneck: Dict Building from Struct Array
```python
# BEFORE: 100ms for 3k rows
result = {
float(row['time']): {
'index': float(row['index']),
'close': float(row['close']),
}
for row in matched_rows
}
# AFTER: <5ms for 3k rows
times = matched_rows['time'].astype(float)
indices = matched_rows['index'].astype(float)
closes = matched_rows['close'].astype(float)
result = {
t: {'index': idx, 'close': cls}
for t, idx, cls in zip(times, indices, closes)
}
```
### Bottleneck: Repeated Field Access
```python
# BEFORE: 50ms for 1k iterations
for i, spec in enumerate(specs):
start_row = array[array['time'] == spec['start_time']][0]
end_row = array[array['time'] == spec['end_time']][0]
process(start_row['index'], end_row['close'])
# AFTER: <5ms for 1k iterations
# 1. Build lookup once
time_to_row = {...} # via searchsorted
# 2. Extract fields to plain arrays beforehand
indices_arr = array['index']
closes_arr = array['close']
# 3. Use lookup + plain array indexing
for spec in specs:
start_idx = time_to_row[spec['start_time']]['array_idx']
end_idx = time_to_row[spec['end_time']]['array_idx']
process(indices_arr[start_idx], closes_arr[end_idx])
```
## References
- NumPy structured arrays: https://numpy.org/doc/stable/user/basics.rec.html
- `np.searchsorted`: https://numpy.org/doc/stable/reference/generated/numpy.searchsorted.html
- Polars: https://pola-rs.github.io/polars/
- `piker.tsp` - timeseries processing utilities
- `piker.data._formatters` - OHLC array handling
## Skill Maintenance
Update when:
- New vectorization patterns discovered
- Performance bottlenecks identified
- Polars migration patterns emerge
- NumPy best practices evolve
---
*Last updated: 2026-01-31*
*Session: Batch gap annotation optimization*
*Key win: 100ms → 5ms dict building via field extraction*