The Question
ML DesignUnified ML Feature Store & Data Platform
Design a high-scale data platform capable of orchestrating both batch and streaming pipelines for a real-time recommendation system. The system must handle over 1M events per second, provide a unified interface for feature engineering to eliminate training-serving skew, and support low-latency (<10ms) feature retrieval for online inference. Detail the architecture for point-in-time correct training set generation, the integration of a feature store, and how you ensure data quality and reliability across the offline and online paths.
Spark
Flink
Kafka
Iceberg
Redis
Debezium
Great Expectations
Feature Store
CDC
Point-in-time Joins
Questions & Insights
Clarifying Questions
Business Goal: Is this platform intended for a specific application (e.g., Real-time Recommendations) or a general-purpose ML platform? (Assumption: Support for a high-scale Recommendation System).
Constraints & Scale: What is the scale of events per second and total features? (Assumption: 1M events/sec, 10k+ features, 100TB+ historical data).
Latency & Freshness: What are the P99 latency requirements for feature retrieval and the data freshness (SLA) for streaming? (Assumption: <10ms retrieval, <1s freshness).
Consistency: Is strict online-offline consistency required? (Assumption: Yes, to prevent training-serving skew).
Assumptions: I assume a hybrid architecture (Lambda/Kappa) using a Feature Store to abstract the complexity of data movement for Data Scientists.
Thinking Process
Identify the Core Conflict: The fundamental challenge is the "Online-Offline Divergence"—where features used for training (batch) differ from those used for inference (streaming), leading to model degradation.
The Bottleneck: Point-in-time joins for historical training data are computationally expensive. Real-time aggregations (e.g., "clicks in the last 5 minutes") require stateful stream processing.
Strategy: I will design a Unified Feature Store architecture. This centralizes the logic so that a single feature definition produces both batch data for training and low-latency lookups for serving.
Scaling: Use decoupled compute (Spark/Flink) and storage (Data Lake/NoSQL) to scale independently.
Elite Bonus Points
Point-in-Time Correctness (Time Travel): Implementing a "pit-join" mechanism to prevent data leakage by ensuring labels are only joined with features that existed at the exact timestamp of the event.
Change Data Capture (CDC): Using CDC from production databases (Postgres/MySQL) via Debezium into Kafka to turn static state into a stream, enabling real-time feature updates without polling.
Feature Versioning & Warm-start: Treating feature logic as code (GitOps). When logic changes, trigger "backfills" to re-calculate historical values and "warm-start" the online store.
Cost-Optimized Tiering: Moving cold features (rarely accessed) to cheaper storage (S3) while keeping hot features in memory (Redis), managed by an automated TTL policy.
Design Breakdown
Requirements
Product Goal: Enable Data Scientists to define features once and deploy them to both training (batch) and serving (streaming) environments.
Success Metrics:
Online Metrics: Feature retrieval latency (P99), Data freshness (End-to-end latency from event to feature availability).
Offline Metrics: Feature consistency score (online vs. offline value match), Training set generation time.
Guardrail Metrics: System uptime (SLAs), Cost per feature, Storage growth.
System Constraints: 1M+ QPS at peak, 100ms end-to-end freshness, multi-region availability.
Data Availability: Interaction logs (Kafka), User/Item metadata (S3/Snowflake), Production DBs (CDC).
ML Problem Framing
ML Task Type: This is a Meta-System Design for Feature Engineering and Data Orchestration.
Prediction Target: The system serves f(x) where x is a vector of features [u, i, c] (User, Item, Context).
Inputs:
Batch: Historical logs in S3.
Streaming: Real-time event streams in Kafka.
Outputs:
Offline: Parquet files for model training.
Online: Low-latency key-value pairs for model inference.
ML Challenges: Training-serving skew, data leakage, and handle massive write-heavy workloads for real-time feature updates.
Design Summary & MVP
Concise Summary: A dual-path architecture using a Unified Feature Store. Batch paths (Spark) handle historical aggregations, while streaming paths (Flink) handle real-time windowed features, both governed by a shared Metadata Layer.
Model Architecture & Selection:
Baseline: Manual SQL scripts and CSV exports.
Target: Feature Store (e.g., Feast or Tecton-like) with Flink for stream-processing and Redis for online serving.
Simplicity Audit: We avoid custom-built engines. We use industry-standard Kafka for messaging, Spark for batch, and Redis for serving. This follows YAGNI by not over-building custom state-management logic.
Architecture Decision Rationale: A Feature Store is the best choice here because it explicitly solves the training-serving skew and provides a registry for discovery and reuse, which is critical at scale.
System Architecture
Pipeline Deep Dive
Data Pipeline
Data Source: High-volume clickstream (Protobuf format in Kafka) and dimensional data (S3/Snowflake).
Data Ingestion: Kafka acts as the buffer. For reliability, we use "At-least-once" semantics with idempotent writes to the Data Lake to ensure no data loss during spikes.
Data Storage: Apache Iceberg on S3. Iceberg allows for schema evolution and ACID transactions, which is crucial when multiple Spark jobs write to the same feature tables.
Data Quality: Deployed at the ingestion gate. Using Great Expectations to validate schemas, null counts, and range checks before data hits the silver/gold layers.
Feature Pipeline
Offline Feature Pipeline: Spark SQL jobs run nightly to compute "Long-term" features (e.g., user's 30-day purchase history). Results are stored in Parquet for high-throughput reading.
Online Feature Pipeline: Flink jobs compute "Real-time" features (e.g., clicks in the last 10 seconds). Flink maintains state in RocksDB to handle windowed aggregations efficiently.
Feature Store: The Registry (YAML/Python-based definitions) ensures that
user_avg_spend is calculated the same way in both Spark (SQL) and Flink (DataStream API).Model Architecture
Task: While the platform is general, we optimize for Ranking (Pointwise).
Architecture: Two-Tower Models for retrieval (User/Item embeddings) and DeepFM/DCN for ranking.
Complexity: Embeddings are updated via the streaming pipeline and pushed to a Vector DB (e.g., Milvus/Pinecone) for low-latency similarity search.
Training Pipeline
Dataset Construction: The Point-in-time Joiner is the "Hero" component. It takes a list of (entity_id, event_timestamp, label) and joins it with the Feature Store history by finding the latest feature value v where v.timestamp \le event.timestamp.
Retraining: Automated via Airflow when "Feature Drift" exceeds a threshold (measured by KS-test on input distributions).
Serving Pipeline
Serving Pattern: Hybrid. Fetch pre-computed features from Redis, compute context-only features (e.g., current device) on the fly.
Latency Optimization: Pipelined Redis requests (MGET) to fetch 100+ features in a single RTT.
Fallback: If the Online Feature Store times out, the service falls back to "Popularity-based" or "Global Average" features to ensure availability.
Evaluation Pipeline
Offline: Metric-store for NDCG and AUC.
Online: Interleaving or A/B testing framework. We log "Feature Snapshots" during inference to the Data Lake to allow for Counterfactual Evaluation later.
Monitoring Pipeline
System: Prometheus/Grafana for QPS, latency, and cache hit rates.
ML Specific: Population Stability Index (PSI) monitoring to detect if the distribution of incoming features at inference time has shifted significantly from the training distribution.
Wrap Up
Final Evaluation
Trade-offs: We chose Eventual Consistency for real-time features (it's okay if a click takes 500ms to show up in the count) to gain massive write scalability.
Failure Modes: If Kafka lags, features become stale. We implement "Freshness Monitoring" that alerts if the gap between
event_time and processing_time grows beyond 5 seconds.Distinguishing Insights: For a Principal level, emphasize the Unification of Feature Logic. Instead of two separate codebases (SQL for Spark, Java for Flink), use a DSL (like Tecton's) or a unified framework (like Apache Beam) that compiles to both engines. This eliminates the #1 cause of model performance mismatch in production.