The Question
ML DesignReal-time Financial Forecasting System
Design a large-scale machine learning system to provide real-time, high-frequency price movement signals for a global trading platform. The system should handle millions of active users, ingest diverse data sources like market ticks and news sentiment, and maintain extreme low latency while addressing the unique challenges of financial time-series data such as non-stationarity and backtesting integrity.
LSTM
Time-Series Transformer
XGBoost
ARIMA
Reinforcement Learning
Questions & Insights
Clarifying Questions
Business Goal: Is the goal to provide real-time signals for retail traders (high QPS) or to drive an automated high-frequency execution engine (ultra-low latency)? Assumption: A real-time signal generation platform for a high-scale retail brokerage.
Constraints & Scale:
Users: 10M DAU.
Assets: 100,000 symbols (Stocks, ETFs, Crypto).
Latency: Signal generation (Inference) < 20ms P99.
Throughput: Support 500k events/sec (market ticks).
Edge Cases: How to handle "Flash Crashes," halted symbols, and IPOs (cold start)? How do we deal with non-stationarity (market regimes changing)?
Assumptions: I assume we are predicting the Log-Return over a 5-minute horizon (regression) rather than the raw price, as returns are more stationary. I also assume we have access to Level-2 Order Book data and a news sentiment stream.
Thinking Process
Identify the Core Challenge: Financial data has a very low signal-to-noise ratio and is non-stationary. The system must prioritize feature freshness and robust backtesting over complex model depth.
Retrieval vs. Ranking: Unlike RecSys, this is a Time-Series Forecasting problem. The "retrieval" phase is essentially selecting which tickers a user is watching, and "ranking" is replaced by signal confidence.
Scaling Strategy: Use a Lambda architecture or a Kappa architecture for features. We need a high-performance Stream Processing engine (Flink) to calculate technical indicators (RSI, Moving Averages) in real-time.
Validation: Traditional K-fold cross-validation is illegal here due to temporal leakage. We must use Sliding Window Time-Series Validation.
Elite Bonus Points
Handling Non-Stationarity: Implement Online Learning or frequent "Champion-Challenger" model updates to adapt to market regime shifts (e.g., Bull to Bear).
Fractional Differencing: Use fractional calculus to make features stationary while preserving "memory" of the price series (better than simple integer differencing).
Adversarial Training: Use a GAN or targeted noise injection to make the model robust against market manipulation or sudden volatility spikes.
Feature Neutralization: Linearize the relationship between features and known risk factors (Market Cap, Sector) to ensure the model finds "Alpha" rather than just betting on the Beta of the market.
Design Breakdown
Functional Reqs
Real-time Prediction: Users receive a price movement signal (Up/Down/Neutral) and a confidence score for any symbol in their watchlist.
Alerting: Trigger notifications when the predicted volatility or price movement exceeds a threshold.
Explainability: Provide a brief "Why" (e.g., "High Order Book Imbalance" or "Positive News Sentiment").
Non-Functional Reqs
Latency: End-to-end (Market Event → Prediction) must be < 50ms.
Availability: 99.99% (Financial markets require high uptime during trading hours).
Freshness: Features must reflect data from the last 100ms.
Consistency: Training data (offline) and Serving data (online) must be identical to avoid "Look-ahead bias."
ML Problem Framing
ML Objective: Predict the k-step forward log-return: r_{t+k} = \ln(P_{t+k} / P_t).
ML Category: Regression (for return) or Multi-class Classification (Up, Down, Flat).
Input: Historical price ticks, Order Book depth, News sentiment, Macro-economic indicators.
Output: \hat{y} \in \mathbb{R} (Predicted return) and \sigma (Confidence/Volatility).
Label: The realized return 5 minutes after the prediction, adjusted for transaction costs (spread).
Data Prep & Features
Data Pipeline:
Market Data: UDP/Multicast feeds or WebSockets (Protobuf/Binary).
Alternative Data: Scraped news, Social media (Twitter/X), SEC filings.
Feature Engineering:
Technical: RSI, MACD, Bollinger Bands, VWAP (Volume Weighted Average Price).
Microstructure: Bid-Ask Spread, Order Book Imbalance (Symmetry of buy/sell orders).
Sequence Features: 1-minute, 5-minute, and 1-hour price windows.
Embeddings: Symbol embeddings (learned via Word2Vec style skip-grams based on co-movement).
Feature Store: Use an online-offline synchronized store (e.g., Tecton or Feast) to ensure the feature
moving_avg_5m is calculated the same way in training and production.Model Architecture
Baseline: Gradient Boosted Decision Trees (XGBoost/LightGBM) are excellent for tabular financial data due to their ability to handle non-linearities and missing values.
Advanced Model: Temporal Fusion Transformer (TFT).
Why?: It handles multi-horizon forecasting, uses Gated Residual Networks to skip noisy inputs, and includes attention mechanisms to focus on specific historical events (e.g., previous day's close).
Loss Function: Huber Loss (robust to outliers/outlier price spikes) or Sharpe-Ratio Maximization (Directly optimizing for risk-adjusted return).
Training & Serving
Optimization: Use a rolling window. Train on months 1-6, validate on month 7, test on month 8. Shift window by 1 month and repeat.
Serving: Export model to ONNX or TensorRT for low-latency C++ inference.
Position Bias/Feedback Loop: In trading, our own predictions might influence our trades, which influences the market. We must model the Market Impact to avoid "over-fitting to our own influence."
System Architecture
Pipeline Deep Dive
Data Pipeline
Ingestion: We use Kafka as the backbone. Market ticks (Trade/Quote) are high-velocity. We use a "Schema Registry" to ensure the Protobuf messages are versioned.
Storage: Raw data is stored in Parquet/Iceberg on S3, partitioned by
Date and Symbol for efficient time-series retrieval.Feature Pipeline
Streaming Features: Flink maintains a 24-hour state window to calculate technical indicators. For news, we use a small LLM (e.g., DistilBERT) to convert text into sentiment scores within the stream.
Training/Serving Skew: We use a "point-in-time" join in the Feature Store. This ensures that when we train on data from 10:00 AM yesterday, we only use features available at 09:59 AM.
Training Pipeline
Distributed Training: Given the volume of global market data, we use Horovod or PyTorch Distributed Data Parallel (DDP).
Objective Strategy: Instead of just predicting price, we train multi-task models: Task A (5m return), Task B (next 1h volatility). This regularizes the model and improves generalizability.
Serving Pipeline
Retrieval: When a user opens their app, the "Retrieval" step identifies symbols in their portfolio and watchlists.
Inference: The Prediction Service pulls pre-computed technical features from Redis and runs the TFT model.
Calibration: We use Platt Scaling or Isotonic Regression to ensure that a "70% confidence" prediction actually happens 70% of the time.
Evaluation Pipeline
Backtesting: This is the "Offline Evaluation." We simulate trades based on model signals. We track Information Ratio and Maximum Drawdown, not just RMSE.
Interleaved Testing: Instead of a 50/50 A/B split (which is hard in time-series), we use interleaved signals to see which model's ranking of "top movers" is more accurate.
Monitoring Pipeline
Feature Drift: We monitor the distribution of inputs (e.g., if market volatility doubles, our features shift). We use the Kolmogorov-Smirnov test to detect this.
Model Staleness: If the correlation between \hat{y} and y drops below a threshold, the system triggers an automated retraining job on the latest data.
Wrap Up
Advanced Topics
Offline Metrics: Spearman’s Rank Correlation (IC - Information Coefficient), Root Mean Squared Error (RMSE), and Precision at top K (for movement direction).
Online Metrics (North Star): Realized Alpha (excess return over benchmark) and User Engagement (do users trade more based on these signals?).
Failure Modes & Risk Mitigation:
Flash Crashes: Hard-coded circuit breakers to stop predictions if volatility exceeds 10 standard deviations.
Selection Bias: We only have data for symbols that survived. We must include delisted stocks in the training set to avoid Survivorship Bias.
Scalability: The system scales horizontally. Flink partitions by
SymbolID, and the Inference Service is a stateless k8s deployment. A 10x traffic spike is handled by auto-scaling the inference pods and increasing Kafka partitions.