The Question
DesignScalable Time-Series Metrics Infrastructure
Design a distributed metrics collection and aggregation system capable of handling 10 million data points per second. The system should support multi-dimensional tagging, real-time alerting queries, and long-term storage with automated downsampling, while ensuring high availability and low ingestion latency for client applications.
Kafka
Flink
TSDB
gRPC
Redis
Protobuf
Questions & Insights
Clarifying Questions
Scale and Throughput: What is the expected volume of metrics? (e.g., 10 million metrics per second, or 100k?).
Data Retention and Resolution: How long do we need to keep raw data versus downsampled data (e.g., 7 days raw, 1 year aggregated)?
Query Patterns: Is the system primarily for real-time alerting (low latency) or historical trend analysis (high throughput)?
Cardinality: How many unique time-series (combinations of metric name + tags) are we expecting? This heavily influences the indexing strategy.
Reliability vs. Cost: Is 100% data durability required for every single metric, or is a small loss (e.g., < 0.1%) acceptable to reduce costs?
Assumptions for MVP:
Scale: 10 million metrics per second (Write-heavy).
Latency: Near real-time (End-to-end latency < 30 seconds).
Retention: 15 days raw data, indefinite for 1-hour rollups.
Cardinality: Up to 100 million active time-series.
Thinking Process
Core Bottleneck: The primary challenge is the "Write Heavy" nature of metrics and the "High Cardinality" problem which chokes traditional indexes.
Step 1: Ingestion Buffer: Use a message bus to decouple high-volume writes from the storage layer to prevent cascading failures during spikes.
Step 2: Stream Aggregation: Perform "In-flight" rollups (Sum, Avg, Max) to reduce the volume of data hitting the long-term storage.
Step 3: Time-Series Storage: Utilize a specialized TSDB that uses LSM-tree or similar structures optimized for time-ordered appends.
Step 4: Series Indexing: Implement a separate index for metadata (tags/labels) to allow fast filtering across millions of series.
Bonus Points
T-Digest for Percentiles: Use T-Digest or HdrHistogram in the aggregation layer to calculate accurate p99/p999 latencies without storing every raw value.
Zero-Copy Ingestion: Use gRPC with Protobuf and sidecar-based local aggregation (e.g., StatsD style) to minimize CPU overhead on application nodes.
Write-Ahead Logging (WAL) Optimization: Discussing how to bypass the OS page cache or use NVMe-optimized storage for the TSDB to handle high IOPS.
Downsampling Policy: Automated background jobs that transition data from high-resolution (1s) to low-resolution (1m, 1h) to optimize storage costs.
Design Breakdown
Functional Requirements
Ingest metrics from various services via HTTP/gRPC.
Support multi-dimensional data (tags/labels).
Provide a query API for retrieving time-series data over a specific range.
Support basic aggregations (Sum, Count, Avg, Min, Max, Percentiles).
Ensure data persistence for historical analysis.
Non-Functional Requirements
High Availability: The system must be available even if one zone fails (99.9% uptime).
Scalability: Must scale horizontally to handle metric bursts during deployments or outages.
Low Latency Ingestion: Ingestion should not block the calling application.
Query Efficiency: Typical dashboard queries should return in < 500ms.
Estimation
Ingestion Volume: 10M metrics/sec.
Data Size: 1 metric entry \approx 100 bytes (Name + Timestamp + Value + Tags).
Raw Bandwidth: 10M * 100 bytes = 1 GB/s ingress.
Daily Storage (Raw): 1 GB/s * 86,400s \approx 86 TB/day.
15-Day Retention: 86 TB * 15 \approx 1.3 PB.
Aggregation Impact: 1-minute rollups reduce data volume by ~60x compared to 1s resolution.
Blueprint
Concise Summary: A distributed pipeline where metrics are pushed to a collector, buffered in Kafka, processed by a stream engine for real-time rollups, and stored in a specialized TSDB.
Major Components:
Metric Collector: Stateless gRPC service that receives and validates incoming metrics.
Kafka: Acts as a high-throughput buffer to protect downstream components from spikes.
Stream Processor: Aggregates raw metrics into time-windowed buckets (rollups).
TSDB: Optimized storage engine for time-series data and tag-based indexing.
Query Service: API layer that handles read requests and performs final aggregations.
Simplicity Audit: This architecture uses the "Lambda" Lite approach—streaming for all data paths—avoiding complex batch-layer synchronization for the MVP.
Architecture Decision Rationale:
Why this architecture?: Decoupling via Kafka is essential for a 10M/s system. Without it, storage backpressure would crash the ingestion services.
Functional Satisfaction: Covers end-to-end flow from ingestion to query.
Non-functional Satisfaction: Kafka provides durability and horizontal scaling; TSDB provides query efficiency for time-range lookups.
High Level Architecture
Sub-system Deep Dive
Service
Topology & Scaling:
Collectors: Stateless, deployed in Auto-scaling groups based on CPU/Network IO.
Query Service: Stateless, scaled based on QPS and latency.
API Schema Design:
Ingest:
POST /v1/metrics (Protobuf over gRPC).Request:
{ metric_name, timestamp, value, tags: {key: value} }Query:
GET /v1/query?query=<expr>&start=<ts>&end=<ts>Resilience & Reliability:
Retries: Collectors return 202 Accepted once data is in Kafka. Clients use exponential backoff for retries.
Circuit Breaker: If Kafka is down, collectors drop metrics (fail-fast) to protect the source applications.
Observability:
The system monitors itself (Meta-metrics) using a separate, smaller instance of the same design.
Storage
Access Pattern: 99% writes for ingestion; 1% reads for dashboards/alerts. Reads are usually for the last 1-6 hours.
Database Table Design:
Data Table:
(series_id, timestamp, value). Clustered by series_id and timestamp.Metadata Index:
(tag_key, tag_value, series_id). Inverted index to map tags to IDs.Technical Selection: M3DB or VictoriaMetrics.
Rationale: These are designed for high-cardinality and provide built-in clustering and downsampling.
Distribution Logic:
Sharding by
series_id to ensure all data for a single metric line lands on the same node for efficient range scans.Cache
Purpose & Justification: Redis is used for "Series Discovery."
Bottleneck: Checking if a
series_id exists in the TSDB for every write is slow.Key-Value Schema:
Key:
metadata:{hash(metric_name+tags)}Value:
series_idTTL: 24 hours (refreshed on hit).
Failure Handling: If Redis is down, the system defaults to writing to TSDB (cache-aside), which handles the "upsert" logic.
Messaging
Purpose & Decoupling: Kafka decouples the ingest spike from the processing speed.
Throughput & Partitioning:
Partitioned by
metric_name to ensure ordering for specific metrics during aggregation.Technical Selection: Kafka.
Rationale: Industry standard for 1GB/s+ throughput with high durability.
Data Processing
Processing Model: Flink (Streaming).
Processing DAG:
Consume from Kafka -> Windowing (1m) -> Aggregate (Sum/Count) -> Write to TSDB + Write to Redis (Series Index).
Correctness Guarantees: "At-least-once" is sufficient for metrics; "Exactly-once" is preferred for billing-related metrics but adds overhead.
Technical Selection: Apache Flink.
Rationale: Best-in-class support for windowing and state management.
Wrap Up
Advanced Topics
Trade-offs (Consistency vs Availability): Following the PACELC theorem, we prioritize Availability (AP). In a metrics system, missing a few data points is better than blocking the entire application's execution.
Bottleneck Analysis:
High Cardinality: If users add "user_id" as a tag, the index explodes.
Mitigation: Implement "Cardinality Guardrails" at the Collector level that reject metrics with too many unique tag values.
Optimization (Data Compression): Use Gorilla Compression (Delta-of-Delta encoding) for timestamps and values, which can compress 16 bytes of data into ~1.37 bits.
Security: Metrics often contain PII in tags. Implement a "PII Scrubber" in the Flink layer to mask sensitive tag values based on a regex registry.