The Question

Real-Time Financial Signal Generation System

Design a high-scale machine learning system to predict short-term equity price movements (e.g., 1-5 minute horizon) for a universe of 5,000 stocks. The system must process real-time market data feeds, perform feature engineering with minimal latency, and provide directional signals with a P99 latency of less than 50ms. Address specific challenges such as non-stationary data, prevention of look-ahead bias in the data pipeline, backtesting strategies, and the design of a robust online/offline feature store to maintain consistency.

LightGBM

Apache Flink

Kafka

Redis

Optuna

Ray

Parquet

Questions & Insights

Clarifying Questions

Clarifying Questions & Constraints:

Business Goal: Is the goal to predict the exact price (Regression) or the direction of movement (Classification)? Assumption: We aim to predict the probability of a positive return over a specific horizon (e.g., next 5 minutes) to support automated trading.

Constraints & Scale: What is the universe of stocks? Assumption: 5,000 liquid equities (e.g., US markets) with a prediction frequency of 1 minute.

Latency Budget: What is the end-to-end latency requirement? Assumption: P99 < 50ms for feature engineering + inference.

Data Freshness: How quickly must new market data be incorporated? Assumption: Near real-time (seconds) via websocket feeds.

Assumptions:

We are designing for a mid-frequency trading signal (minutes, not microseconds).

We assume access to L1 market data (Price, Volume) and basic fundamental data.

We assume a P99 latency of 50ms is acceptable for the serving layer.

Thinking Process

Identify the Core Difficulty: Financial data is non-stationary and has a low signal-to-noise ratio. The primary challenge isn't just the model, but feature engineering and avoiding data leakage (look-ahead bias).

Retrieval vs. Ranking: Unlike RecSys, this is a pure Time-Series Forecasting / Classification problem. We need a pipeline that handles sequential dependencies.

Scaling Strategy: Horizontal scaling for inference (per stock) and a high-throughput stream processing engine (Flink/Kafka) for real-time feature computation.

Simplification (YAGNI): Start with a Gradient Boosted Decision Tree (LightGBM) using technical indicators. Deep learning (LSTMs/Transformers) adds complexity that may not beat a well-featured tree model in the MVP phase.

Elite Bonus Points

Stationarity & Fractional Differentiation: Instead of simple differencing (

P_t - P_{t-1}

) which loses memory, use Fractional Differentiation to make the series stationary while preserving long-term memory.

Triple Barrier Method: Move beyond fixed-horizon labeling. Use the Triple Barrier Method (Profit Take, Stop Loss, Time Out) to label data more realistically for trading environments.

Lead-Lag Relationships: Incorporate features from correlated assets (e.g., if SPY moves, Apple often follows) to capture cross-asset momentum.

Adversarial Validation: Check if the training and test distributions are significantly different to detect "regime shifts" in the market before they degrade model performance.

Design Breakdown

Requirements

Product Goal: Generate a signal (probability of price increase) to inform buy/sell decisions.

Success Metrics:

Online Metrics: Sharpe Ratio, Maximum Drawdown, Precision (Directional Accuracy).

Offline Metrics: Log-Loss, F1-Score, PR-AUC.

Guardrail Metrics: Inference Latency (P99), Training-Serving Feature Drift.

System Constraints: 5k stocks, 1-minute updates, 50ms latency, 99.9% availability.

Data Availability: Historical OHLCV (Open, High, Low, Close, Volume), order book snapshots, and macroeconomic calendars.

ML Problem Framing

ML Task Type: Binary Classification (or Multi-class: Up, Down, Neutral).

Prediction Target:

P(\text{Return}_{t+h} > \text{threshold} | \text{MarketState}_t)

Inputs:

User Features: N/A (usually anonymous/market-wide).

Item Features (Stock): Ticker metadata, Sector, Market Cap.

Context Features: Time of day (market open/close volatility), VIX index (volatility context), interest rates.

Sequential Features: RSI (Relative Strength Index), MACD, Moving Averages, Order Book Imbalance.

Outputs: Probability score

[0, 1]

ML Challenges: Data Leakage (using future info in training), Non-stationarity (regime changes), and Overfitting to noise.

Design Summary & MVP

Concise Summary: A real-time streaming pipeline that extracts technical indicators from market feeds, serves predictions via a LightGBM model, and logs results for offline backtesting.

Model Architecture & Selection:

Baseline Model: Logistic Regression on 5-minute returns.

Target Model: LightGBM (Gradient Boosted Trees).

Choice Rationale: GBDTs handle non-linear relationships and tabular features (technical indicators) exceptionally well and are more robust to outliers/noise than deep learning models in the financial domain.

ML Life Cycle Summary: Market data flows into Flink for real-time windowing -> Features stored in Redis (Online) and S3 (Offline) -> LightGBM trained on historical windows -> Served via high-speed API.

Simplicity Audit: Avoids complex LSTMs/Transformers initially; uses well-understood technical indicators and a high-performance tree model.

Architecture Decision Rationale:

Why?: Financial signals are weak; features (RSI, etc.) matter more than model depth. GBDTs provide the best ROI on complexity.

Requirement Satisfaction: Meets latency targets (<50ms) and scale (5k stocks) via distributed inference.

System Architecture

Pipeline Deep Dive

Data Pipeline

Data Source: L1/L2 Market feeds (Polygon.io, Alpaca), fundamental data (SEC filings), and alternative data (News/Twitter).

Data Ingestion: Kafka serves as the backbone. It handles high-throughput tick data with at-least-once semantics. We partition by Ticker ID to ensure sequential ordering per stock.

Data Storage:

Data Lake (S3): Stores raw ticks in Parquet format for cost-effective historical backtesting.

Data Warehouse (Snowflake/BigQuery): For structured fundamental data and metadata.

Data Processing: Apache Flink performs stateful stream processing. It calculates rolling windows (e.g., 5-min, 15-min moving averages) and handles late-arriving data via watermarking.

Feature Pipeline

Feature Definition:

Momentum: RSI, Rate of Change.

Volatility: Bollinger Bands, ATR.

Volume: On-Balance Volume (OBV), VWAP.

Microstructure: Bid-Ask Spread, Order Book Imbalance.

Online vs. Offline: Flink computes features in real-time for the Online Feature Store (Redis). The same logic is applied to historical data for the Offline Feature Store (S3) to prevent Training-Serving skew.

Point-in-Time Joins: Crucial to prevent leakage. When creating the training set, we ensure we only use features available before the prediction timestamp.

Model Architecture

Problem Formulation: Binary classification (

y=1

if return

> 0.1\%

in 5 mins).

Architecture: LightGBM.

Input Layer: Dense vector of technical indicators + stock embeddings (Entity Embedding for Tickers).

Model: Multiple shallow trees (depth 6-10) to prevent overfitting.

Rationale:

Speed: LightGBM is significantly faster than XGBoost for high-dimensional financial data.

Categorical Support: Built-in handling for Ticker IDs and Sector IDs.

Optimization: Use Quantization (Int8) on features to reduce memory footprint and speed up inference.

Training Pipeline

Dataset Construction: Use the Triple Barrier Method. A label is 1 if price hits the upper barrier first, 0 if it hits the lower barrier or time limit.

Data Splitting: Time-series Cross-Validation (Walk-Forward Evaluation). Do not use random splits; always train on past, test on future.

Infrastructure: Distributed training using Ray to parallelize hyperparameter sweeps (Optuna).

Retraining: Scheduled weekly retraining to adapt to new market regimes, triggered automatically if "Adversarial Validation" score between last week and this week exceeds a threshold.

Serving Pipeline

Serving Pattern: Request-Response via gRPC for lowest latency.

Architecture: Containerized LightGBM models on K8s.

Latency Optimization:

Pre-fetch features from Redis in parallel with the request.

Use C++ prediction kernels for LightGBM.

Reliability: If the model service fails, fall back to a "Last Value" heuristic (momentum persistence).

Evaluation Pipeline

Offline Evaluation: Use Purged K-Fold Cross Validation to remove data overlap. Primary metric: Information Coefficient (IC) and Precision at K.

Online Evaluation: Paper Trading (Shadow Mode). Run the model on live data without executing trades to verify Sharpe Ratio and Drawdown in real-world conditions before committing capital.

Monitoring Pipeline

System Monitoring: Track feature_retrieval_latency and model_inference_latency.

Data Monitoring: Track Feature Drift (PSI). Financial features drift rapidly (e.g., volume spikes).

Model Monitoring: Monitor the Prediction Mean. If the model suddenly starts predicting "Buy" for 100% of stocks, a circuit breaker should trigger.

Wrap Up

Final Evaluation

Observability: Real-time dashboards for P&L, signal distribution, and feature importance.

Edge Cases:

Flash Crashes: Hard-coded volatility halts in the feature pipeline.

Dividends/Splits: Automated data cleaning to adjust historical prices.

Trade-offs:

Complexity vs. Latency: Adding more features (e.g., NLP sentiment) increases latency; MVP keeps it to structured data.

Recall vs. Precision: In trading, Precision (accuracy of the "Buy" signal) is often prioritized over Recall to minimize losing trades.