The Question

Scalable Unified Data Platform (Batch & Streaming)

Design a high-throughput data platform capable of processing 100TB of daily data via both batch and streaming jobs. The system must support 'Exactly-Once' processing semantics, unified metadata management for cross-job consistency, and a 'Lakehouse' storage model to handle ACID transactions on object storage. Address how the platform will manage the 'small files' problem inherent in streaming ingestion and ensure cost-efficient resource scaling during peak loads.

Apache Spark

Apache Kafka

Apache Iceberg

Kubernetes

Apache Airflow

Trino

Avro

Prometheus

OpenLineage

Questions & Insights

Clarifying Questions

Scale and Throughput: What is the expected daily data volume (e.g., 10TB vs. 1PB) and peak events per second for streaming?

Latency Requirements: Does "streaming" imply sub-second real-time processing or is near-real-time (seconds to minutes) acceptable?

Data Variety: Are we handling structured (SQL), semi-structured (JSON/Logs), or unstructured (Images/Binary) data?

User Personas: Is the platform self-service for Data Scientists/Analysts to submit their own SQL/Python scripts, or is it a fixed pipeline for a specific application?

Assumptions for this design:

Scale: 100 TB of daily ingestion; ~1M events per second peak.

Latency: Streaming jobs require < 5-second end-to-end latency.

Architecture: A "Lakehouse" approach using Apache Iceberg to provide ACID transactions over S3.

Environment: Cloud-native (AWS/GCP/Azure) with Kubernetes-based compute.

Thinking Process

Core Bottleneck: The primary challenge is unifying the storage format and metadata management so that batch and streaming jobs can operate on the same data without duplication or consistency issues.

Key Questions for Design:

How do we ingest high-velocity streams while maintaining durability?

How do we achieve "Exactly-Once" processing semantics across both batch and stream?

How do we provide a unified table abstraction that supports both small-append streaming and large-overwrite batch updates?

How do we scale compute independently of storage to manage costs?

Bonus Points

Table Format (Iceberg): Using Apache Iceberg instead of raw Parquet files to solve the "small files" problem in streaming and provide snapshot isolation.

Compute Unification: Leveraging Spark Structured Streaming as a single engine for both batch and stream to reduce code-base fragmentation and operational overhead.

Spot Instance Orchestration: Implementing a "Graceful Decommissioning" strategy for Spark on K8s to utilize cheap Spot instances without losing intermediate shuffle data.

Data Governance: Integrating an automated Schema Registry and Lineage tracking (e.g., OpenLineage) from day one to ensure data quality.

Design Breakdown

Functional Requirements

Core Use Cases:

Ingest real-time events via a message bus.

Execute scheduled batch ETL (Extract, Transform, Load) jobs.

Provide a unified SQL interface for querying data.

Support "Time-travel" (querying historical versions of data).

Scope Control:

In-scope: Data ingestion, processing, and storage architecture.

Out-of-scope: Complex visualization/BI tool building, machine learning model training logic, or source-system database administration.

Non-Functional Requirements

Scale: Support horizontal scaling of compute (1000s of nodes) and storage (Exabyte-ready).

Latency: Streaming latency < 5s; Batch job throughput optimized for large-scale joins.

Availability & Reliability: 99.9% availability; data must be persisted in three availability zones (Object Storage).

Consistency: ACID transactions at the table level (Lakehouse model).

Fault Tolerance: Automatic retries for batch; Checkpointing for streaming to resume from the last offset.

Estimation

Traffic: 100 TB/day = ~1.15 GB/s average. Peak is ~3x (3.5 GB/s).

Events: Assuming 1KB avg size, ~1.15M events/sec.

Storage: 100 TB/day with 30-day retention for hot data = 3 PB. (Compressed 3:1 = 1 PB).

Compute: Assuming 1 core per 10MB/s processing, we need ~120-300 cores for ingestion and significantly more for batch transformations.

Blueprint

Concise Summary: A unified Lakehouse architecture where Apache Kafka handles ingestion, Apache Spark (on K8s) handles both batch and streaming compute, and Apache Iceberg on S3 provides the storage layer with ACID guarantees.

Major Components:

Message Bus (Kafka): Decouples producers from consumers and provides a 24-hour replay buffer.

Compute Engine (Spark): Unified engine for Structured Streaming (real-time) and Spark SQL (batch).

Lakehouse Storage (Iceberg + S3): High-performance table format that manages metadata and ensures consistency.

Orchestrator (Airflow): Manages dependencies and scheduling for batch pipelines.

Simplicity Audit: By using Spark for both Batch and Stream (Structured Streaming), we eliminate the need for two different technology stacks (like Flink + Spark), reducing operational complexity for the MVP.

Architecture Decision Rationale:

Why this architecture?: The Lakehouse model (Iceberg/S3) provides the reliability of a Data Warehouse with the cost-efficiency of a Data Lake.

Functional Satisfaction: Covers both streaming ingestion and batch processing using a unified API.

Non-functional Satisfaction: S3 scales infinitely; Spark on K8s allows for dynamic resource allocation and cost control.

High Level Architecture

Sub-system Deep Dive

Service

Topology & Scaling:

Spark on Kubernetes (K8s): Deploying Spark drivers and executors as pods. This allows for fine-grained resource scaling (Horizontal Pod Autoscaler) based on CPU/Memory utilization.

API Schema Design:

Job Submission API (REST):

POST /v1/jobs: Submits a Spark JAR or Python file.

GET /v1/jobs/{id}/status: Monitors health.

Protocols: gRPC for high-performance internal communication; REST for external management.

Resilience & Reliability:

Checkpointing: Streaming jobs store their state in S3. If a pod fails, the new pod reads the checkpoint and resumes from the exact Kafka offset.

Retry Logic: Airflow handles retries for batch jobs with exponential backoff.

Storage

Access Pattern:

Write-heavy for streaming ingestion; Read-heavy for batch analytics and BI.

Database Table Design:

Apache Iceberg:

Snapshot management: Every write creates a new snapshot.

Partitioning: Hidden partitioning (e.g., by day/hour) to prune files during queries.

Compaction: A background process merges "small files" generated by streaming into large Parquet files for read optimization.

Technical Selection: S3 + Iceberg.

Rationale: Decouples compute and storage. S3 provides 99.999999999% durability. Iceberg provides ACID transactions which S3 lacks natively.

Distribution Logic:

Sharding is handled by S3 prefixing (using Iceberg's hash-based naming) to avoid S3 request throttling.

Messaging

Purpose & Decoupling: Kafka acts as the shock absorber. It buffers bursts of incoming data and allows multiple consumers (Streaming Job, Archival Job) to read the same data independently.

Event / Topic Schema:

Use Avro for the payload schema.

Partitioning: Keyed by user_id or device_id to ensure ordering within a partition.

Technical Selection: Kafka.

Rationale: Industry standard for high-throughput messaging with strong ordering guarantees.

Data Processing

Processing Model: Spark Structured Streaming.

Uses the same DataFrame API as batch, allowing developers to write logic once and toggle between read (batch) and readStream (streaming).

Correctness Guarantees:

Exactly-once: Achieved through the combination of Kafka offsets and Iceberg's atomic commit (idempotent writes).

Technical Selection: Spark.

Rationale: Best-in-class for batch processing; Structured Streaming is mature enough for most "near-real-time" needs.

Infrastructure (Optional)

Observability:

Prometheus/Grafana: Monitoring Spark executor metrics and Kafka lag.

OpenLineage: Capturing metadata about which job produced which table for data governance.

Wrap Up

Advanced Topics

Trade-offs (Latency vs. Cost): We use Spark Structured Streaming with "Micro-batching" (e.g., 1-minute intervals). This increases latency slightly compared to Flink's "Continuous" mode but significantly reduces costs by allowing more efficient file writes to S3.

Reliability: We implement a Dead Letter Queue (DLQ). If a record fails schema validation or processing, it is sent to a separate Kafka topic for manual inspection, preventing the entire pipeline from stalling.

Bottleneck Analysis:

Small Files: Streaming writes small files. Mitigation: Run an "Iceberg Maintenance" job every hour to compact files.

S3 Throttling: Mitigation: Use a tiered prefixing strategy in Iceberg metadata.

Distinguishing Insights:

Schema Evolution: Iceberg supports "Full Schema Evolution" (adding, renaming, dropping columns) without rewriting data files, which is critical for long-running streaming platforms where source schemas change often.

Cost Optimization: Using S3 Intelligent-Tiering to automatically move data that hasn't been accessed in 30 days to a cheaper storage tier (Infrequent Access).