The Question
DesignDistributed Log Collection System
Design a distributed log collection system that aggregates server logs from multiple global data centers. The system should ingest high-volume log streams in real time, support structured and unstructured log formats, and make logs searchable for debugging, monitoring, and alerting purposes.
Kafka
ClickHouse
S3
Fluent Bit
Redis
Questions & Insights
Thinking Process
The core challenge of global log collection is balancing ingestion durability against cross-region egress costs. A naive approach of sending every log line over the WAN will be prohibitively expensive and brittle.
How do we minimize the impact of global network latency and egress costs?
Use regional collectors to batch, compress, and deduplicate logs locally before transmitting them to the central hub.
How do we handle "noisy neighbors" or sudden log spikes in one DC?
Implement a persistent messaging buffer (Kafka/Pulsar) at the ingestion point to decouple log production from the storage indexing rate.
How do we ensure zero data loss during a 1-hour WAN partition?
Use local disk-buffered agents (e.g., Fluent Bit) and a regional messaging layer that can store data until the link is restored.
What is the most cost-effective storage for TBs of logs?
A tiered approach: Index only recent logs in a Columnar store (ClickHouse) or Search Engine (OpenSearch), while archiving everything to low-cost Object Storage (S3).
Bonus Points
Adaptive Sampling: Automatically drop
DEBUG level logs or sample INFO logs when the regional buffer exceeds a specific threshold (backpressure-driven shedding).Schema-on-Read vs. Schema-on-Write: Using Protobuf for internal transport to reduce payload size by up to 80% compared to JSON, while maintaining strict schema evolution.
Edge Processing: Implementing "Log Fingerprinting" at the agent level to group similar stack traces, sending only the pattern and a hit count rather than redundant raw strings.
Data Sovereignty Logic: Implementing metadata tagging to ensure logs from specific regions (e.g., EU-West-1 for GDPR) are routed to regional silos rather than a global pool.
Design Breakdown
Functional Requirements
Collect logs from 10,000+ servers across 10 global regions.
Centralized search and visualization interface.
Support for varied log formats (JSON, Syslog, Multi-line).
Retention policy: 7 days hot (searchable), 30 days cold (archived).
Non-Functional Requirements
Durability: 99.99% (Logs must not be lost once acknowledged by the regional collector).
Scalability: Handle 10 TB+ of log data per day.
Low Latency: Logs should be searchable within 60 seconds of generation.
Cost Efficiency: Minimize cross-region data transfer fees.
Estimation
Total Logs: 10 DCs 1,000 servers/DC * 1 GB/server/day = 10 TB/day**.
Average Throughput: 10 TB / 86,400s ≈ 115 MB/s global average.
Peak Load: 3x average ≈ 350 MB/s.
Storage (Hot): 10 TB * 7 days = 70 TB (before replication/indexing overhead).
Egress Savings: 5:1 compression ratio via Zstandard reduces 10 TB/day WAN traffic to 2 TB/day.
Blueprint
Concise Summary: A hub-and-spoke architecture where lightweight agents push logs to Regional Buffers, which are then drained by a Central Ingestion Pipeline into a Columnar Database.
Log Agent: Lightweight sidecar (Fluent Bit) that tails files and buffers to local disk.
Regional Messaging: Kafka cluster per DC to provide backpressure and local durability.
Central Aggregator: Consumer service that pulls from regional Kafkas, performs normalization, and writes to storage.
Columnar Store: ClickHouse for high-speed indexing and SQL-based log querying.
Object Storage: S3 for long-term, low-cost raw log archival.
Simplicity Audit: This design avoids complex stream processing (Flink) and heavy search engines (Elasticsearch) in favor of ClickHouse, which handles high-volume inserts on commodity hardware more efficiently for an MVP.
High Level Architecture
Sub-system Deep Dive
Service
Log Agents: Configured as a DaemonSet (K8s) or systemd service. Uses Disk-Assisted Queues; if the Regional Kafka is down, agents buffer up to 10GB locally.
Central Aggregator: A Go-based stateless service.
Batching: Pulls messages from 10 Regional Kafkas and batches them into 10MB chunks or 5-second windows before writing to ClickHouse to optimize IOPS.
Normalization: Converts various timestamps to UTC and injects
region_id and host_id metadata.Storage
Data Model (ClickHouse):
timestamp (DateTime64) - Primary Sort Key.service_name (LowCardinality String) - Secondary Sort Key.log_level (Enum).message (String).metadata (Map String, String).Database Logic: Uses the
MergeTree engine. Employs TTL moves to automatically shift data older than 7 days from SSD to S3-backed tables.Cache
Redis: Stores "Throttling Rules" and "Service Metadata".
TTLs: 5 minutes.
Logic: If a specific
service_id exceeds its log quota (defined in Metadata), the Aggregator signals the Regional Agents (via config update or backpressure) to drop non-critical logs.Messaging
Regional Kafka:
Topic Structure: Single
raw_logs topic with partitions based on host_id to ensure ordering per host.Retention: 24 hours (sufficient for resolving WAN outages).
Compression:
producer level compression (Zstd) enabled to save bandwidth.Wrap Up
Advanced Topics
Trade-offs:
Consistency vs. Availability: The system favors Availability. In the event of a storage delay, Kafka buffers data; we prioritize accepting logs over immediate searchability.
Search Complexity: ClickHouse is less flexible than Elasticsearch for full-text fuzzy searching but 10x more efficient for structured log aggregation and "grep-like" filtering.
Bottlenecks:
The Central Aggregator could become a bottleneck. It must be horizontally scaled based on the consumer lag in Regional Kafka clusters.
WAN Egress: 10 global DCs can incur massive costs. Compression is mandatory.
Failure Handling:
Kafka Partitioning: If a Kafka broker fails, the Log Agents use internal retries and round-robin to healthy brokers.
Regional Isolation: If the Central Hub is down, Regional Kafkas hold 24h of data.
Alternatives:
Loki: Could be used instead of ClickHouse. Loki is cheaper (only indexes metadata) but slower for deep-text searching across large volumes.
Vector: Could replace Fluent Bit for higher performance and better native transformation capabilities.