The Question
DesignUnified Data Lakehouse Platform
Design a scalable data platform capable of processing both high-throughput real-time streams and large-scale batch workloads using a unified storage layer. The system must support ACID transactions, schema evolution, and provide a SQL-based interface for analytical queries while maintaining low-latency ingestion and cost-efficient historical storage.
Kafka
S3
Apache Iceberg
Flink
Spark
Trino
Kubernetes
AWS Glue
Parquet
ACID
CDC
Questions & Insights
Clarifying Questions
What is the expected scale of data? (Assumption: 10TB daily ingestion, 100k events/sec peak for streaming).
What are the latency requirements for the "streaming" path? (Assumption: Near real-time, i.e., < 10 seconds end-to-end latency).
Who are the primary consumers? (Assumption: Data Scientists using Python/Spark and Data Analysts using SQL).
What is the source diversity? (Assumption: RDBMS CDC, Application logs via Kafka, and third-party SaaS via S3).
How is data consistency handled between batch and stream? (Assumption: We will use a Lakehouse architecture to provide a unified view of data).
Thinking Process
Core Bottleneck: The "Lambda Architecture" complexity (maintaining two codebases for the same logic).
Step 1: Establish a Unified Storage Layer (Lakehouse) using Apache Iceberg to allow both batch and stream engines to read/write to the same tables with ACID guarantees.
Step 2: Implement an Event-Driven Ingestion layer using Kafka for streaming and a "Landing Zone" for batch files.
Step 3: Deploy Decoupled Compute using Flink for low-latency transformations and Spark for heavy-lift historical processing.
Step 4: Provide a Federated Query Layer (Trino) to allow users to query data without knowing the underlying storage format.
Bonus Points
Zero-Copy Data Sharing: Use of Apache Iceberg’s metadata-only branching and tagging for staging data without physical copies.
Cost Optimization: Implementing S3 Intelligent-Tiering and Z-Ordering on Iceberg tables to minimize scanning costs for large analytical queries.
Schema Evolution: Supporting full schema evolution (add, drop, rename) without rewriting historical data, handled at the Iceberg metadata layer.
Data Quality Gateways: Integrating Great Expectations or Deequ as part of the Spark/Flink pipelines to prevent "garbage-in" to the Gold layer.
Design Breakdown
Functional Requirements
Support real-time ingestion and processing (Streaming).
Support large-scale historical data processing (Batch).
Provide ACID transactions on the Data Lake (Update/Delete support).
Unified SQL interface for data discovery and querying.
Support for Schema Registry to enforce data contracts.
Non-Functional Requirements
Scalability: Horizontal scaling of compute and storage to handle PB-scale data.
Durability: 99.999999999% (11 nines) data durability.
Availability: 99.9% availability for the query and processing layers.
Consistency: Strong consistency for metadata; eventual consistency for global read replicas.
Estimation
Ingestion: 10 TB/day = ~115 MB/s average throughput.
Streaming Peak: 100k events/sec * 1 KB/event = 100 MB/s.
Storage: 10 TB/day * 365 days = 3.65 PB/year (pre-compression). With 3x compression (Parquet/Zstd), ~1.2 PB/year.
Compute: Assuming 1 Spark core processes 20MB/s, we need ~100-200 cores for peak batch processing.
Blueprint
Concise Summary: A Lakehouse-based platform using Kafka for ingestion, S3/Iceberg for storage, and Flink/Spark for compute.
Major Components:
Kafka: Acts as the unified message bus for all streaming inputs and CDC events.
Apache Iceberg (on S3): Provides the table format for ACID transactions and unified batch/stream storage.
Flink: Handles stateful stream processing and real-time aggregations.
Spark: Executes heavy batch ETL jobs, backfills, and machine learning workloads.
Trino: Serving engine for interactive SQL queries.
Simplicity Audit: This architecture avoids the complexity of a separate Data Warehouse (like Snowflake) by using Iceberg to turn S3 into a functional database, reducing data movement and costs.
Architecture Decision Rationale:
Why?: The Lakehouse pattern is the current industry standard for unifying batch and streaming without maintaining separate systems.
Functional: Meets all needs from ingestion to querying via a "Medallion" architecture (Bronze/Silver/Gold).
Non-functional: S3 provides infinite scale; Flink/Spark/Trino are horizontally scalable via Kubernetes.
High Level Architecture
Sub-system Deep Dive
Service
Topology & Scaling:
Job Manager: A Kubernetes-based service (e.g., using Flink/Spark Operators) to manage job lifecycles.
Scaling: Flink uses Horizontal Pod Autoscaler (HPA) based on Kafka lag; Spark scales based on YARN/K8s resource requests.
API Schema Design:
Job Submission API:
POST /v1/jobs (REST).Request:
{ type: "FLINK", jar_url: "s3://...", parallelism: 10, config: {...} }.Idempotency: Provided via unique Job IDs and state tracking in an internal DB.
Resilience & Reliability:
Flink Checkpointing: Enabled every 1 minute to S3 for fault tolerance.
Spark Speculation: Enabled to handle straggler nodes in batch jobs.
Storage
Access Pattern:
Bronze: Append-only raw data (High write, low read).
Silver/Gold: Compacted Parquet files via Iceberg (High read, moderate write/updates).
Database Table Design (Iceberg):
event_id: UUID (Primary Key - logical).event_time: Timestamp (Partition Key).data_payload: Struct/JSON.ingestion_time: Timestamp (Used for freshness tracking).Technical Selection: Apache Iceberg.
Rationale: Supports snapshot isolation, partition evolution, and time-travel queries.
Distribution Logic:
Partitioning: Hidden partitioning by
days(event_time) to avoid small file problems and optimize scan range.Messaging
Purpose & Decoupling: Kafka decouples producers from the processing engines and provides a 7-day buffer for streaming data.
Throughput & Partitioning:
Partitioning: Key-based partitioning (e.g.,
user_id) to ensure in-order processing for specific entities.Failure Handling: Dead-letter queues (DLQ) for malformed JSON/Avro messages.
Technical Selection: Kafka. High-throughput, industry standard for streaming ingestion.
Data Processing
Processing Model:
Streaming (Flink): Uses "Exactly-once" processing via two-phase commit to Iceberg.
Batch (Spark): Uses "Overwrite" or "Merge Into" semantics for upserts.
Processing DAG:
Ingest -> Cleanse (Filter nulls) -> Enrich (Join with dimension tables) -> Aggregate (Windowed counts) -> Sink (Iceberg).
Technical Selection: Flink for <10s latency requirements; Spark for historical re-processing and complex joins.
Infrastructure (Optional)
Observability:
Prometheus/Grafana: Monitoring Kafka lag, Flink checkpoint duration, and Spark executor memory.
OpenLineage: For data lineage tracking to see how data flows from source to final report.
Platform Security:
IAM Roles: Fine-grained access to S3 buckets.
Apache Ranger: For column-level and row-level access control on Trino/Iceberg.
Wrap Up
Advanced Topics
Trade-offs: We choose Eventual Consistency for the query layer (Trino) to favor high availability and throughput during massive write operations.
Reliability: Multi-AZ deployment of Kafka and Kubernetes nodes ensures that the failure of a single data center doesn't stop data ingestion.
Bottleneck Analysis:
Small Files: Streaming writes create many small files. Optimization: Run a periodic Spark "Compaction" job to merge small Iceberg files into larger Parquet blocks.
Metadata Pressure: The Glue Catalog can become a bottleneck. Optimization: Use Iceberg's REST catalog if Glue latency exceeds thresholds.
Security: Data is encrypted at rest using KMS (S3-SSE) and in transit via TLS 1.3.