DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
The Question
ML Design

Unified ML Feature Store & Data Platform

Design a high-scale data platform capable of orchestrating both batch and streaming pipelines for a real-time recommendation system. The system must handle over 1M events per second, provide a unified interface for feature engineering to eliminate training-serving skew, and support low-latency (<10ms) feature retrieval for online inference. Detail the architecture for point-in-time correct training set generation, the integration of a feature store, and how you ensure data quality and reliability across the offline and online paths.
Spark
Flink
Kafka
Iceberg
Redis
Debezium
Great Expectations
Feature Store
CDC
Point-in-time Joins
Questions & Insights

Clarifying Questions

Business Goal: Is this platform intended for a specific application (e.g., Real-time Recommendations) or a general-purpose ML platform? (Assumption: Support for a high-scale Recommendation System).
Constraints & Scale: What is the scale of events per second and total features? (Assumption: 1M events/sec, 10k+ features, 100TB+ historical data).
Latency & Freshness: What are the P99 latency requirements for feature retrieval and the data freshness (SLA) for streaming? (Assumption: <10ms retrieval, <1s freshness).
Consistency: Is strict online-offline consistency required? (Assumption: Yes, to prevent training-serving skew).
Assumptions: I assume a hybrid architecture (Lambda/Kappa) using a Feature Store to abstract the complexity of data movement for Data Scientists.

Thinking Process

Identify the Core Conflict: The fundamental challenge is the "Online-Offline Divergence"—where features used for training (batch) differ from those used for inference (streaming), leading to model degradation.
The Bottleneck: Point-in-time joins for historical training data are computationally expensive. Real-time aggregations (e.g., "clicks in the last 5 minutes") require stateful stream processing.
Strategy: I will design a Unified Feature Store architecture. This centralizes the logic so that a single feature definition produces both batch data for training and low-latency lookups for serving.
Scaling: Use decoupled compute (Spark/Flink) and storage (Data Lake/NoSQL) to scale independently.

Elite Bonus Points

Point-in-Time Correctness (Time Travel): Implementing a "pit-join" mechanism to prevent data leakage by ensuring labels are only joined with features that existed at the exact timestamp of the event.
Change Data Capture (CDC): Using CDC from production databases (Postgres/MySQL) via Debezium into Kafka to turn static state into a stream, enabling real-time feature updates without polling.
Feature Versioning & Warm-start: Treating feature logic as code (GitOps). When logic changes, trigger "backfills" to re-calculate historical values and "warm-start" the online store.
Cost-Optimized Tiering: Moving cold features (rarely accessed) to cheaper storage (S3) while keeping hot features in memory (Redis), managed by an automated TTL policy.
Design Breakdown

Requirements

Product Goal: Enable Data Scientists to define features once and deploy them to both training (batch) and serving (streaming) environments.
Success Metrics:
Online Metrics: Feature retrieval latency (P99), Data freshness (End-to-end latency from event to feature availability).
Offline Metrics: Feature consistency score (online vs. offline value match), Training set generation time.
Guardrail Metrics: System uptime (SLAs), Cost per feature, Storage growth.
System Constraints: 1M+ QPS at peak, 100ms end-to-end freshness, multi-region availability.
Data Availability: Interaction logs (Kafka), User/Item metadata (S3/Snowflake), Production DBs (CDC).

ML Problem Framing

ML Task Type: This is a Meta-System Design for Feature Engineering and Data Orchestration.
Prediction Target: The system serves f(x) where x is a vector of features [u, i, c] (User, Item, Context).
Inputs:
Batch: Historical logs in S3.
Streaming: Real-time event streams in Kafka.
Outputs:
Offline: Parquet files for model training.
Online: Low-latency key-value pairs for model inference.
ML Challenges: Training-serving skew, data leakage, and handle massive write-heavy workloads for real-time feature updates.

Design Summary & MVP

Concise Summary: A dual-path architecture using a Unified Feature Store. Batch paths (Spark) handle historical aggregations, while streaming paths (Flink) handle real-time windowed features, both governed by a shared Metadata Layer.
Model Architecture & Selection:
Baseline: Manual SQL scripts and CSV exports.
Target: Feature Store (e.g., Feast or Tecton-like) with Flink for stream-processing and Redis for online serving.
Simplicity Audit: We avoid custom-built engines. We use industry-standard Kafka for messaging, Spark for batch, and Redis for serving. This follows YAGNI by not over-building custom state-management logic.
Architecture Decision Rationale: A Feature Store is the best choice here because it explicitly solves the training-serving skew and provides a registry for discovery and reuse, which is critical at scale.

System Architecture

Pipeline Deep Dive

Data Pipeline

Data Source: High-volume clickstream (Protobuf format in Kafka) and dimensional data (S3/Snowflake).
Data Ingestion: Kafka acts as the buffer. For reliability, we use "At-least-once" semantics with idempotent writes to the Data Lake to ensure no data loss during spikes.
Data Storage: Apache Iceberg on S3. Iceberg allows for schema evolution and ACID transactions, which is crucial when multiple Spark jobs write to the same feature tables.
Data Quality: Deployed at the ingestion gate. Using Great Expectations to validate schemas, null counts, and range checks before data hits the silver/gold layers.

Feature Pipeline

Offline Feature Pipeline: Spark SQL jobs run nightly to compute "Long-term" features (e.g., user's 30-day purchase history). Results are stored in Parquet for high-throughput reading.
Online Feature Pipeline: Flink jobs compute "Real-time" features (e.g., clicks in the last 10 seconds). Flink maintains state in RocksDB to handle windowed aggregations efficiently.
Feature Store: The Registry (YAML/Python-based definitions) ensures that user_avg_spend is calculated the same way in both Spark (SQL) and Flink (DataStream API).

Model Architecture

Task: While the platform is general, we optimize for Ranking (Pointwise).
Architecture: Two-Tower Models for retrieval (User/Item embeddings) and DeepFM/DCN for ranking.
Complexity: Embeddings are updated via the streaming pipeline and pushed to a Vector DB (e.g., Milvus/Pinecone) for low-latency similarity search.

Training Pipeline

Dataset Construction: The Point-in-time Joiner is the "Hero" component. It takes a list of (entity_id, event_timestamp, label) and joins it with the Feature Store history by finding the latest feature value v where v.timestamp \le event.timestamp.
Retraining: Automated via Airflow when "Feature Drift" exceeds a threshold (measured by KS-test on input distributions).

Serving Pipeline

Serving Pattern: Hybrid. Fetch pre-computed features from Redis, compute context-only features (e.g., current device) on the fly.
Latency Optimization: Pipelined Redis requests (MGET) to fetch 100+ features in a single RTT.
Fallback: If the Online Feature Store times out, the service falls back to "Popularity-based" or "Global Average" features to ensure availability.

Evaluation Pipeline

Offline: Metric-store for NDCG and AUC.
Online: Interleaving or A/B testing framework. We log "Feature Snapshots" during inference to the Data Lake to allow for Counterfactual Evaluation later.

Monitoring Pipeline

System: Prometheus/Grafana for QPS, latency, and cache hit rates.
ML Specific: Population Stability Index (PSI) monitoring to detect if the distribution of incoming features at inference time has shifted significantly from the training distribution.
Wrap Up

Final Evaluation

Trade-offs: We chose Eventual Consistency for real-time features (it's okay if a click takes 500ms to show up in the count) to gain massive write scalability.
Failure Modes: If Kafka lags, features become stale. We implement "Freshness Monitoring" that alerts if the gap between event_time and processing_time grows beyond 5 seconds.
Distinguishing Insights: For a Principal level, emphasize the Unification of Feature Logic. Instead of two separate codebases (SQL for Spark, Java for Flink), use a DSL (like Tecton's) or a unified framework (like Apache Beam) that compiles to both engines. This eliminates the #1 cause of model performance mismatch in production.