The Question
ML DesignReal-Time Financial Signal Generation System
Design a high-scale machine learning system to predict short-term equity price movements (e.g., 1-5 minute horizon) for a universe of 5,000 stocks. The system must process real-time market data feeds, perform feature engineering with minimal latency, and provide directional signals with a P99 latency of less than 50ms. Address specific challenges such as non-stationary data, prevention of look-ahead bias in the data pipeline, backtesting strategies, and the design of a robust online/offline feature store to maintain consistency.
LightGBM
Apache Flink
Kafka
Redis
Optuna
Ray
Parquet
Questions & Insights
Clarifying Questions
Clarifying Questions & Constraints:
Business Goal: Is the goal to predict the exact price (Regression) or the direction of movement (Classification)? Assumption: We aim to predict the probability of a positive return over a specific horizon (e.g., next 5 minutes) to support automated trading.
Constraints & Scale: What is the universe of stocks? Assumption: 5,000 liquid equities (e.g., US markets) with a prediction frequency of 1 minute.
Latency Budget: What is the end-to-end latency requirement? Assumption: P99 < 50ms for feature engineering + inference.
Data Freshness: How quickly must new market data be incorporated? Assumption: Near real-time (seconds) via websocket feeds.
Assumptions:
We are designing for a mid-frequency trading signal (minutes, not microseconds).
We assume access to L1 market data (Price, Volume) and basic fundamental data.
We assume a P99 latency of 50ms is acceptable for the serving layer.
Thinking Process
Identify the Core Difficulty: Financial data is non-stationary and has a low signal-to-noise ratio. The primary challenge isn't just the model, but feature engineering and avoiding data leakage (look-ahead bias).
Retrieval vs. Ranking: Unlike RecSys, this is a pure Time-Series Forecasting / Classification problem. We need a pipeline that handles sequential dependencies.
Scaling Strategy: Horizontal scaling for inference (per stock) and a high-throughput stream processing engine (Flink/Kafka) for real-time feature computation.
Simplification (YAGNI): Start with a Gradient Boosted Decision Tree (LightGBM) using technical indicators. Deep learning (LSTMs/Transformers) adds complexity that may not beat a well-featured tree model in the MVP phase.
Elite Bonus Points
Stationarity & Fractional Differentiation: Instead of simple differencing (P_t - P_{t-1}) which loses memory, use Fractional Differentiation to make the series stationary while preserving long-term memory.
Triple Barrier Method: Move beyond fixed-horizon labeling. Use the Triple Barrier Method (Profit Take, Stop Loss, Time Out) to label data more realistically for trading environments.
Lead-Lag Relationships: Incorporate features from correlated assets (e.g., if SPY moves, Apple often follows) to capture cross-asset momentum.
Adversarial Validation: Check if the training and test distributions are significantly different to detect "regime shifts" in the market before they degrade model performance.
Design Breakdown
Requirements
Product Goal: Generate a signal (probability of price increase) to inform buy/sell decisions.
Success Metrics:
Online Metrics: Sharpe Ratio, Maximum Drawdown, Precision (Directional Accuracy).
Offline Metrics: Log-Loss, F1-Score, PR-AUC.
Guardrail Metrics: Inference Latency (P99), Training-Serving Feature Drift.
System Constraints: 5k stocks, 1-minute updates, 50ms latency, 99.9% availability.
Data Availability: Historical OHLCV (Open, High, Low, Close, Volume), order book snapshots, and macroeconomic calendars.
ML Problem Framing
ML Task Type: Binary Classification (or Multi-class: Up, Down, Neutral).
Prediction Target: P(\text{Return}_{t+h} > \text{threshold} | \text{MarketState}_t).
Inputs:
User Features: N/A (usually anonymous/market-wide).
Item Features (Stock): Ticker metadata, Sector, Market Cap.
Context Features: Time of day (market open/close volatility), VIX index (volatility context), interest rates.
Sequential Features: RSI (Relative Strength Index), MACD, Moving Averages, Order Book Imbalance.
Outputs: Probability score [0, 1].
ML Challenges: Data Leakage (using future info in training), Non-stationarity (regime changes), and Overfitting to noise.
Design Summary & MVP
Concise Summary: A real-time streaming pipeline that extracts technical indicators from market feeds, serves predictions via a LightGBM model, and logs results for offline backtesting.
Model Architecture & Selection:
Baseline Model: Logistic Regression on 5-minute returns.
Target Model: LightGBM (Gradient Boosted Trees).
Choice Rationale: GBDTs handle non-linear relationships and tabular features (technical indicators) exceptionally well and are more robust to outliers/noise than deep learning models in the financial domain.
ML Life Cycle Summary: Market data flows into Flink for real-time windowing -> Features stored in Redis (Online) and S3 (Offline) -> LightGBM trained on historical windows -> Served via high-speed API.
Simplicity Audit: Avoids complex LSTMs/Transformers initially; uses well-understood technical indicators and a high-performance tree model.
Architecture Decision Rationale:
Why?: Financial signals are weak; features (RSI, etc.) matter more than model depth. GBDTs provide the best ROI on complexity.
Requirement Satisfaction: Meets latency targets (<50ms) and scale (5k stocks) via distributed inference.
System Architecture
Pipeline Deep Dive
Data Pipeline
Data Source: L1/L2 Market feeds (Polygon.io, Alpaca), fundamental data (SEC filings), and alternative data (News/Twitter).
Data Ingestion: Kafka serves as the backbone. It handles high-throughput tick data with at-least-once semantics. We partition by Ticker ID to ensure sequential ordering per stock.
Data Storage:
Data Lake (S3): Stores raw ticks in Parquet format for cost-effective historical backtesting.
Data Warehouse (Snowflake/BigQuery): For structured fundamental data and metadata.
Data Processing: Apache Flink performs stateful stream processing. It calculates rolling windows (e.g., 5-min, 15-min moving averages) and handles late-arriving data via watermarking.
Feature Pipeline
Feature Definition:
Momentum: RSI, Rate of Change.
Volatility: Bollinger Bands, ATR.
Volume: On-Balance Volume (OBV), VWAP.
Microstructure: Bid-Ask Spread, Order Book Imbalance.
Online vs. Offline: Flink computes features in real-time for the Online Feature Store (Redis). The same logic is applied to historical data for the Offline Feature Store (S3) to prevent Training-Serving skew.
Point-in-Time Joins: Crucial to prevent leakage. When creating the training set, we ensure we only use features available before the prediction timestamp.
Model Architecture
Problem Formulation: Binary classification (y=1 if return > 0.1\% in 5 mins).
Architecture: LightGBM.
Input Layer: Dense vector of technical indicators + stock embeddings (Entity Embedding for Tickers).
Model: Multiple shallow trees (depth 6-10) to prevent overfitting.
Rationale:
Speed: LightGBM is significantly faster than XGBoost for high-dimensional financial data.
Categorical Support: Built-in handling for Ticker IDs and Sector IDs.
Optimization: Use Quantization (Int8) on features to reduce memory footprint and speed up inference.
Training Pipeline
Dataset Construction: Use the Triple Barrier Method. A label is 1 if price hits the upper barrier first, 0 if it hits the lower barrier or time limit.
Data Splitting: Time-series Cross-Validation (Walk-Forward Evaluation). Do not use random splits; always train on past, test on future.
Infrastructure: Distributed training using Ray to parallelize hyperparameter sweeps (Optuna).
Retraining: Scheduled weekly retraining to adapt to new market regimes, triggered automatically if "Adversarial Validation" score between last week and this week exceeds a threshold.
Serving Pipeline
Serving Pattern: Request-Response via gRPC for lowest latency.
Architecture: Containerized LightGBM models on K8s.
Latency Optimization:
Pre-fetch features from Redis in parallel with the request.
Use C++ prediction kernels for LightGBM.
Reliability: If the model service fails, fall back to a "Last Value" heuristic (momentum persistence).
Evaluation Pipeline
Offline Evaluation: Use Purged K-Fold Cross Validation to remove data overlap. Primary metric: Information Coefficient (IC) and Precision at K.
Online Evaluation: Paper Trading (Shadow Mode). Run the model on live data without executing trades to verify Sharpe Ratio and Drawdown in real-world conditions before committing capital.
Monitoring Pipeline
System Monitoring: Track
feature_retrieval_latency and model_inference_latency.Data Monitoring: Track Feature Drift (PSI). Financial features drift rapidly (e.g., volume spikes).
Model Monitoring: Monitor the Prediction Mean. If the model suddenly starts predicting "Buy" for 100% of stocks, a circuit breaker should trigger.
Wrap Up
Final Evaluation
Observability: Real-time dashboards for P&L, signal distribution, and feature importance.
Edge Cases:
Flash Crashes: Hard-coded volatility halts in the feature pipeline.
Dividends/Splits: Automated data cleaning to adjust historical prices.
Trade-offs:
Complexity vs. Latency: Adding more features (e.g., NLP sentiment) increases latency; MVP keeps it to structured data.
Recall vs. Precision: In trading, Precision (accuracy of the "Buy" signal) is often prioritized over Recall to minimize losing trades.