The Question
Design

Scalable Time-Series Metrics Infrastructure

Design a distributed metrics collection and aggregation system capable of handling 10 million data points per second. The system should support multi-dimensional tagging, real-time alerting queries, and long-term storage with automated downsampling, while ensuring high availability and low ingestion latency for client applications.
Kafka
Flink
TSDB
gRPC
Redis
Protobuf
Questions & Insights

Clarifying Questions

Scale and Throughput: What is the expected volume of metrics? (e.g., 10 million metrics per second, or 100k?).
Data Retention and Resolution: How long do we need to keep raw data versus downsampled data (e.g., 7 days raw, 1 year aggregated)?
Query Patterns: Is the system primarily for real-time alerting (low latency) or historical trend analysis (high throughput)?
Cardinality: How many unique time-series (combinations of metric name + tags) are we expecting? This heavily influences the indexing strategy.
Reliability vs. Cost: Is 100% data durability required for every single metric, or is a small loss (e.g., < 0.1%) acceptable to reduce costs?
Assumptions for MVP:
Scale: 10 million metrics per second (Write-heavy).
Latency: Near real-time (End-to-end latency < 30 seconds).
Retention: 15 days raw data, indefinite for 1-hour rollups.
Cardinality: Up to 100 million active time-series.

Thinking Process

Core Bottleneck: The primary challenge is the "Write Heavy" nature of metrics and the "High Cardinality" problem which chokes traditional indexes.
Step 1: Ingestion Buffer: Use a message bus to decouple high-volume writes from the storage layer to prevent cascading failures during spikes.
Step 2: Stream Aggregation: Perform "In-flight" rollups (Sum, Avg, Max) to reduce the volume of data hitting the long-term storage.
Step 3: Time-Series Storage: Utilize a specialized TSDB that uses LSM-tree or similar structures optimized for time-ordered appends.
Step 4: Series Indexing: Implement a separate index for metadata (tags/labels) to allow fast filtering across millions of series.

Bonus Points

T-Digest for Percentiles: Use T-Digest or HdrHistogram in the aggregation layer to calculate accurate p99/p999 latencies without storing every raw value.
Zero-Copy Ingestion: Use gRPC with Protobuf and sidecar-based local aggregation (e.g., StatsD style) to minimize CPU overhead on application nodes.
Write-Ahead Logging (WAL) Optimization: Discussing how to bypass the OS page cache or use NVMe-optimized storage for the TSDB to handle high IOPS.
Downsampling Policy: Automated background jobs that transition data from high-resolution (1s) to low-resolution (1m, 1h) to optimize storage costs.
Design Breakdown

Functional Requirements

Ingest metrics from various services via HTTP/gRPC.
Support multi-dimensional data (tags/labels).
Provide a query API for retrieving time-series data over a specific range.
Support basic aggregations (Sum, Count, Avg, Min, Max, Percentiles).
Ensure data persistence for historical analysis.

Non-Functional Requirements

High Availability: The system must be available even if one zone fails (99.9% uptime).
Scalability: Must scale horizontally to handle metric bursts during deployments or outages.
Low Latency Ingestion: Ingestion should not block the calling application.
Query Efficiency: Typical dashboard queries should return in < 500ms.

Estimation

Ingestion Volume: 10M metrics/sec.
Data Size: 1 metric entry \approx 100 bytes (Name + Timestamp + Value + Tags).
Raw Bandwidth: 10M * 100 bytes = 1 GB/s ingress.
Daily Storage (Raw): 1 GB/s * 86,400s \approx 86 TB/day.
15-Day Retention: 86 TB * 15 \approx 1.3 PB.
Aggregation Impact: 1-minute rollups reduce data volume by ~60x compared to 1s resolution.

Blueprint

Concise Summary: A distributed pipeline where metrics are pushed to a collector, buffered in Kafka, processed by a stream engine for real-time rollups, and stored in a specialized TSDB.
Major Components:
Metric Collector: Stateless gRPC service that receives and validates incoming metrics.
Kafka: Acts as a high-throughput buffer to protect downstream components from spikes.
Stream Processor: Aggregates raw metrics into time-windowed buckets (rollups).
TSDB: Optimized storage engine for time-series data and tag-based indexing.
Query Service: API layer that handles read requests and performs final aggregations.
Simplicity Audit: This architecture uses the "Lambda" Lite approach—streaming for all data paths—avoiding complex batch-layer synchronization for the MVP.
Architecture Decision Rationale:
Why this architecture?: Decoupling via Kafka is essential for a 10M/s system. Without it, storage backpressure would crash the ingestion services.
Functional Satisfaction: Covers end-to-end flow from ingestion to query.
Non-functional Satisfaction: Kafka provides durability and horizontal scaling; TSDB provides query efficiency for time-range lookups.

High Level Architecture

Sub-system Deep Dive

Service

Topology & Scaling:
Collectors: Stateless, deployed in Auto-scaling groups based on CPU/Network IO.
Query Service: Stateless, scaled based on QPS and latency.
API Schema Design:
Ingest: POST /v1/metrics (Protobuf over gRPC).
Request: { metric_name, timestamp, value, tags: {key: value} }
Query: GET /v1/query?query=<expr>&start=<ts>&end=<ts>
Resilience & Reliability:
Retries: Collectors return 202 Accepted once data is in Kafka. Clients use exponential backoff for retries.
Circuit Breaker: If Kafka is down, collectors drop metrics (fail-fast) to protect the source applications.
Observability:
The system monitors itself (Meta-metrics) using a separate, smaller instance of the same design.

Storage

Access Pattern: 99% writes for ingestion; 1% reads for dashboards/alerts. Reads are usually for the last 1-6 hours.
Database Table Design:
Data Table: (series_id, timestamp, value). Clustered by series_id and timestamp.
Metadata Index: (tag_key, tag_value, series_id). Inverted index to map tags to IDs.
Technical Selection: M3DB or VictoriaMetrics.
Rationale: These are designed for high-cardinality and provide built-in clustering and downsampling.
Distribution Logic:
Sharding by series_id to ensure all data for a single metric line lands on the same node for efficient range scans.

Cache

Purpose & Justification: Redis is used for "Series Discovery."
Bottleneck: Checking if a series_id exists in the TSDB for every write is slow.
Key-Value Schema:
Key: metadata:{hash(metric_name+tags)}
Value: series_id
TTL: 24 hours (refreshed on hit).
Failure Handling: If Redis is down, the system defaults to writing to TSDB (cache-aside), which handles the "upsert" logic.

Messaging

Purpose & Decoupling: Kafka decouples the ingest spike from the processing speed.
Throughput & Partitioning:
Partitioned by metric_name to ensure ordering for specific metrics during aggregation.
Technical Selection: Kafka.
Rationale: Industry standard for 1GB/s+ throughput with high durability.

Data Processing

Processing Model: Flink (Streaming).
Processing DAG:
Consume from Kafka -> Windowing (1m) -> Aggregate (Sum/Count) -> Write to TSDB + Write to Redis (Series Index).
Correctness Guarantees: "At-least-once" is sufficient for metrics; "Exactly-once" is preferred for billing-related metrics but adds overhead.
Technical Selection: Apache Flink.
Rationale: Best-in-class support for windowing and state management.
Wrap Up

Advanced Topics

Trade-offs (Consistency vs Availability): Following the PACELC theorem, we prioritize Availability (AP). In a metrics system, missing a few data points is better than blocking the entire application's execution.
Bottleneck Analysis:
High Cardinality: If users add "user_id" as a tag, the index explodes.
Mitigation: Implement "Cardinality Guardrails" at the Collector level that reject metrics with too many unique tag values.
Optimization (Data Compression): Use Gorilla Compression (Delta-of-Delta encoding) for timestamps and values, which can compress 16 bytes of data into ~1.37 bits.
Security: Metrics often contain PII in tags. Implement a "PII Scrubber" in the Flink layer to mask sensitive tag values based on a regex registry.