DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
The Question
ML Design

Real-Time Financial Signal Generation System

Design a high-scale machine learning system to predict short-term equity price movements (e.g., 1-5 minute horizon) for a universe of 5,000 stocks. The system must process real-time market data feeds, perform feature engineering with minimal latency, and provide directional signals with a P99 latency of less than 50ms. Address specific challenges such as non-stationary data, prevention of look-ahead bias in the data pipeline, backtesting strategies, and the design of a robust online/offline feature store to maintain consistency.
LightGBM
Apache Flink
Kafka
Redis
Optuna
Ray
Parquet
Questions & Insights

Clarifying Questions

Clarifying Questions & Constraints:
Business Goal: Is the goal to predict the exact price (Regression) or the direction of movement (Classification)? Assumption: We aim to predict the probability of a positive return over a specific horizon (e.g., next 5 minutes) to support automated trading.
Constraints & Scale: What is the universe of stocks? Assumption: 5,000 liquid equities (e.g., US markets) with a prediction frequency of 1 minute.
Latency Budget: What is the end-to-end latency requirement? Assumption: P99 < 50ms for feature engineering + inference.
Data Freshness: How quickly must new market data be incorporated? Assumption: Near real-time (seconds) via websocket feeds.
Assumptions:
We are designing for a mid-frequency trading signal (minutes, not microseconds).
We assume access to L1 market data (Price, Volume) and basic fundamental data.
We assume a P99 latency of 50ms is acceptable for the serving layer.

Thinking Process

Identify the Core Difficulty: Financial data is non-stationary and has a low signal-to-noise ratio. The primary challenge isn't just the model, but feature engineering and avoiding data leakage (look-ahead bias).
Retrieval vs. Ranking: Unlike RecSys, this is a pure Time-Series Forecasting / Classification problem. We need a pipeline that handles sequential dependencies.
Scaling Strategy: Horizontal scaling for inference (per stock) and a high-throughput stream processing engine (Flink/Kafka) for real-time feature computation.
Simplification (YAGNI): Start with a Gradient Boosted Decision Tree (LightGBM) using technical indicators. Deep learning (LSTMs/Transformers) adds complexity that may not beat a well-featured tree model in the MVP phase.

Elite Bonus Points

Stationarity & Fractional Differentiation: Instead of simple differencing (P_t - P_{t-1}) which loses memory, use Fractional Differentiation to make the series stationary while preserving long-term memory.
Triple Barrier Method: Move beyond fixed-horizon labeling. Use the Triple Barrier Method (Profit Take, Stop Loss, Time Out) to label data more realistically for trading environments.
Lead-Lag Relationships: Incorporate features from correlated assets (e.g., if SPY moves, Apple often follows) to capture cross-asset momentum.
Adversarial Validation: Check if the training and test distributions are significantly different to detect "regime shifts" in the market before they degrade model performance.
Design Breakdown

Requirements

Product Goal: Generate a signal (probability of price increase) to inform buy/sell decisions.
Success Metrics:
Online Metrics: Sharpe Ratio, Maximum Drawdown, Precision (Directional Accuracy).
Offline Metrics: Log-Loss, F1-Score, PR-AUC.
Guardrail Metrics: Inference Latency (P99), Training-Serving Feature Drift.
System Constraints: 5k stocks, 1-minute updates, 50ms latency, 99.9% availability.
Data Availability: Historical OHLCV (Open, High, Low, Close, Volume), order book snapshots, and macroeconomic calendars.

ML Problem Framing

ML Task Type: Binary Classification (or Multi-class: Up, Down, Neutral).
Prediction Target: P(\text{Return}_{t+h} > \text{threshold} | \text{MarketState}_t).
Inputs:
User Features: N/A (usually anonymous/market-wide).
Item Features (Stock): Ticker metadata, Sector, Market Cap.
Context Features: Time of day (market open/close volatility), VIX index (volatility context), interest rates.
Sequential Features: RSI (Relative Strength Index), MACD, Moving Averages, Order Book Imbalance.
Outputs: Probability score [0, 1].
ML Challenges: Data Leakage (using future info in training), Non-stationarity (regime changes), and Overfitting to noise.

Design Summary & MVP

Concise Summary: A real-time streaming pipeline that extracts technical indicators from market feeds, serves predictions via a LightGBM model, and logs results for offline backtesting.
Model Architecture & Selection:
Baseline Model: Logistic Regression on 5-minute returns.
Target Model: LightGBM (Gradient Boosted Trees).
Choice Rationale: GBDTs handle non-linear relationships and tabular features (technical indicators) exceptionally well and are more robust to outliers/noise than deep learning models in the financial domain.
ML Life Cycle Summary: Market data flows into Flink for real-time windowing -> Features stored in Redis (Online) and S3 (Offline) -> LightGBM trained on historical windows -> Served via high-speed API.
Simplicity Audit: Avoids complex LSTMs/Transformers initially; uses well-understood technical indicators and a high-performance tree model.
Architecture Decision Rationale:
Why?: Financial signals are weak; features (RSI, etc.) matter more than model depth. GBDTs provide the best ROI on complexity.
Requirement Satisfaction: Meets latency targets (<50ms) and scale (5k stocks) via distributed inference.

System Architecture

Pipeline Deep Dive

Data Pipeline

Data Source: L1/L2 Market feeds (Polygon.io, Alpaca), fundamental data (SEC filings), and alternative data (News/Twitter).
Data Ingestion: Kafka serves as the backbone. It handles high-throughput tick data with at-least-once semantics. We partition by Ticker ID to ensure sequential ordering per stock.
Data Storage:
Data Lake (S3): Stores raw ticks in Parquet format for cost-effective historical backtesting.
Data Warehouse (Snowflake/BigQuery): For structured fundamental data and metadata.
Data Processing: Apache Flink performs stateful stream processing. It calculates rolling windows (e.g., 5-min, 15-min moving averages) and handles late-arriving data via watermarking.

Feature Pipeline

Feature Definition:
Momentum: RSI, Rate of Change.
Volatility: Bollinger Bands, ATR.
Volume: On-Balance Volume (OBV), VWAP.
Microstructure: Bid-Ask Spread, Order Book Imbalance.
Online vs. Offline: Flink computes features in real-time for the Online Feature Store (Redis). The same logic is applied to historical data for the Offline Feature Store (S3) to prevent Training-Serving skew.
Point-in-Time Joins: Crucial to prevent leakage. When creating the training set, we ensure we only use features available before the prediction timestamp.

Model Architecture

Problem Formulation: Binary classification (y=1 if return > 0.1\% in 5 mins).
Architecture: LightGBM.
Input Layer: Dense vector of technical indicators + stock embeddings (Entity Embedding for Tickers).
Model: Multiple shallow trees (depth 6-10) to prevent overfitting.
Rationale:
Speed: LightGBM is significantly faster than XGBoost for high-dimensional financial data.
Categorical Support: Built-in handling for Ticker IDs and Sector IDs.
Optimization: Use Quantization (Int8) on features to reduce memory footprint and speed up inference.

Training Pipeline

Dataset Construction: Use the Triple Barrier Method. A label is 1 if price hits the upper barrier first, 0 if it hits the lower barrier or time limit.
Data Splitting: Time-series Cross-Validation (Walk-Forward Evaluation). Do not use random splits; always train on past, test on future.
Infrastructure: Distributed training using Ray to parallelize hyperparameter sweeps (Optuna).
Retraining: Scheduled weekly retraining to adapt to new market regimes, triggered automatically if "Adversarial Validation" score between last week and this week exceeds a threshold.

Serving Pipeline

Serving Pattern: Request-Response via gRPC for lowest latency.
Architecture: Containerized LightGBM models on K8s.
Latency Optimization:
Pre-fetch features from Redis in parallel with the request.
Use C++ prediction kernels for LightGBM.
Reliability: If the model service fails, fall back to a "Last Value" heuristic (momentum persistence).

Evaluation Pipeline

Offline Evaluation: Use Purged K-Fold Cross Validation to remove data overlap. Primary metric: Information Coefficient (IC) and Precision at K.
Online Evaluation: Paper Trading (Shadow Mode). Run the model on live data without executing trades to verify Sharpe Ratio and Drawdown in real-world conditions before committing capital.

Monitoring Pipeline

System Monitoring: Track feature_retrieval_latency and model_inference_latency.
Data Monitoring: Track Feature Drift (PSI). Financial features drift rapidly (e.g., volume spikes).
Model Monitoring: Monitor the Prediction Mean. If the model suddenly starts predicting "Buy" for 100% of stocks, a circuit breaker should trigger.
Wrap Up

Final Evaluation

Observability: Real-time dashboards for P&L, signal distribution, and feature importance.
Edge Cases:
Flash Crashes: Hard-coded volatility halts in the feature pipeline.
Dividends/Splits: Automated data cleaning to adjust historical prices.
Trade-offs:
Complexity vs. Latency: Adding more features (e.g., NLP sentiment) increases latency; MVP keeps it to structured data.
Recall vs. Precision: In trading, Precision (accuracy of the "Buy" signal) is often prioritized over Recall to minimize losing trades.