The Question

Stock Price Forecasting System

Design a high-scale machine learning system to predict next-day adjusted closing prices for a universe of 10,000+ global equities. Your design should emphasize the prevention of temporal data leakage, handle non-stationary financial time-series data, and describe a robust backtesting framework. Detail the end-to-end architecture from data ingestion of OHLCV feeds to batch inference, including strategies for handling market regime shifts and monitoring model performance against financial benchmarks like the Sharpe Ratio.

XGBoost

LightGBM

Kafka

Spark

Delta Lake

Feast

Redis

Airflow

Huber Loss

Questions & Insights

Clarifying Questions

Business Goal: Is the objective to provide a point-in-time price forecast for retail users, or is this for an automated trading execution system?

Assumption: We are building a price forecasting service for a retail investment platform to predict the "Adjusted Close" price for the next market day.

Constraints & Scale: How many assets are we covering, and what is the required inference latency?

Assumption: We cover 10,000 tickers (Stocks, ETFs) with a P99 latency of <200ms.

Data Freshness: How quickly must the model incorporate new market data?

Assumption: Daily retraining or incremental updates are sufficient for an MVP, but feature updates must happen as soon as the market closes.

Prediction Horizon: Are we predicting the price in 1 minute, 1 hour, or 1 day?

Assumption: Next-day closing price (T+1).

Thinking Process

Identify the Core Challenge: Financial time-series data is notoriously non-stationary and noisy. The "signal-to-noise" ratio is extremely low compared to RecSys or CV.

Avoid Overengineering: While LSTMs or Transformers are trendy, for an MVP, Gradient Boosted Decision Trees (GBDTs) like XGBoost often outperform deep learning on tabular technical indicators due to their robustness to outliers and ease of feature importance analysis.

Prevent Leakage: The biggest pitfall in FinML is "look-ahead bias." I must ensure the training pipeline uses a strict temporal split (walk-forward validation) rather than random K-fold.

Feature Focus: The value lies in feature engineering (momentum, volatility, sentiment) rather than just raw price inputs.

Elite Bonus Points

Stationarity & Log-Returns: Instead of predicting raw price

P_t

(which is non-stationary), I will predict

log(\frac{P_t}{P_{t-1}})

. This stabilizes variance and makes the target more amenable to ML models.

Fractional Differencing: To preserve memory in time-series while achieving stationarity, I could mention using fractional differentiation instead of integer differencing (standard

I(1)

Adversarial Validation: To detect if the distribution of the "test" day is significantly different from the "train" period (market regime change).

Feature Neutralization: Techniques to ensure the model isn't just learning a "Sector" bias or "Market" bias, but actual idiosyncratic stock movements.

Design Breakdown

Requirements

Product Goal: Provide users with a predicted price range and direction for the next trading day.

Success Metrics:

Online Metrics: User Engagement (CTR on predictions), Portfolio Alpha (simulated).

Offline Metrics: Mean Absolute Percentage Error (MAPE), Directional Accuracy (Did it go up when we said it would?), Sharpe Ratio of a strategy based on model output.

Guardrail Metrics: Prediction Volatility (we don't want the model jumping 20% daily for blue chips).

System Constraints: 10k assets, Daily predictions, 99.9% Availability.

Data Availability: OHLCV (Open, High, Low, Close, Volume) data, Corporate actions (splits/dividends), fundamental data (P/E ratios).

ML Problem Framing

ML Task Type: Regression (Time-series forecasting).

Prediction Target: Next-day Log-Return:

y = \ln(Price_{t+1} / Price_t)

Inputs:

User Context: (N/A for stock-only prediction, but relevant if personalized).

Item (Stock) Features: Historical OHLCV, RSI (Relative Strength Index), MACD, Bollinger Bands, Moving Averages (5d, 20d, 200d).

Context Features: Market volatility (VIX), Interest rates (10Y Treasury), Sector-level performance.

ML Challenges: High noise, regime shifts (bull vs. bear markets), and extreme outliers (black swan events).

Design Summary & MVP

Concise Summary: A batch-trained XGBoost regressor that ingests daily market aggregates and outputs a predicted return for the next T+1 interval.

Model Architecture & Selection:

Baseline Model: Last-value persistence (Predict

P_{t+1} = P_t

) or Linear Regression.

Target Model: XGBoost / LightGBM.

Choice Rationale: GBDTs handle non-linear relationships and missing values well, are faster to iterate on than LSTMs, and provide feature importance which is crucial for financial auditing.

Simplicity Audit: We avoid LSTMs/Transformers for the MVP because they require massive amounts of data to beat a well-tuned XGBoost on daily intervals.

Architecture Decision Rationale: Use a Lambda architecture for features: Batch for historical indicators, Stream for the latest market close data.

System Architecture

Pipeline Deep Dive

Data Pipeline

Data Source: Ingest from providers like Bloomberg, Polygon.io, or Alpaca.

Data Ingestion: Use Kafka for real-time price updates. For the MVP (daily predictions), a batch Airflow job triggered at market close (4:00 PM EST) is sufficient.

Data Storage: S3 for raw JSON/Parquet files. Use Delta Lake for ACID transactions to handle corrections (stocks often have price corrections post-market).

Data Quality: CRITICAL. Check for 0 volumes, negative prices, and sudden 10x jumps (usually unadjusted stock splits).

Feature Pipeline

Feature Definition:

Technical: Moving averages, RSI, ATR (Volatility).

Fundamental: Market Cap, Sector (One-hot encoded).

Temporal: Day of week, Month (January effect), Days until Earnings.

Feature Store: Use a system like Feast. It’s vital to prevent Temporal Leakage (using tomorrow's data to predict today). The feature store ensures that when we generate training data for 2023-01-01, we only see features available before that timestamp.

Model Architecture

Core Architecture: XGBoost Regressor.

Loss Function: Mean Squared Error (MSE) on log-returns.

Handling Outliers: Financial data has fat tails. We use Huber Loss or MAE to be less sensitive to extreme single-day moves (e.g., a stock crashing 50% on a fraud scandal).

Optimization: Use Early Stopping based on a validation set that is chronologically after the training set.

Training Pipeline

Dataset Construction: Use a sliding window. Train on 2 years, validate on 3 months, test on the next month. Slide the window forward by 1 month and repeat (Walk-forward validation).

Label Construction:

y = \text{clip}(\frac{P_{t+1} - P_t}{P_t}, -0.2, 0.2)

. We clip extreme moves to prevent the gradient from exploding.

Retraining: Weekly retraining is likely enough for a daily prediction system unless volatility spikes.

Serving Pipeline

Pattern: Batch Inference. Since we only predict once per day after market close, we pre-calculate all 10k predictions and store them in a Key-Value store (Redis/DynamoDB).

SLA: Even though it's batch, the "computation window" between market close (4 PM) and market open (9:30 AM) is our deadline.

Fallback: If the model fails, return the 5-day Moving Average (Heuristic).

Evaluation Pipeline

Offline: Backtesting. This is more than just RMSE. We simulate a portfolio: if the model predicts >2% gain, we "buy". We measure the resulting Sharpe Ratio and Maximum Drawdown.

Online: A/B test the model against a "Top Gainer" heuristic.

Monitoring Pipeline

Concept Drift: Use Population Stability Index (PSI). Markets change regimes (e.g., high-interest rates vs. low). If the feature distribution shifts (e.g., VIX stays above 30), alert the team.

Backfill/Delayed Feedback: We get the "ground truth" labels 24 hours after the prediction. The monitoring system automatically calculates the error for the previous day's batch.

Wrap Up

Final Evaluation

Trade-offs:

Complexity vs. Returns: A simple XGBoost is easier to maintain than a Deep AR / Transformer model and likely achieves 95% of the performance.

Freshness: We prioritize "Market Close" data. Intraday "Real-time" prediction is a massive leap in complexity (requires Flink, low-latency C++, etc.) and is out of scope for an MVP.

Failure Modes: Market holidays, Flash crashes (where historical patterns break), and API outages.