The Question
Design

Scalable Centralized Logging and Observability Platform

Design a high-scale centralized logging system capable of ingesting 100 TB of data per day from thousands of microservices. The system must support real-time full-text search for the most recent 7 days of logs and provide cost-effective archival for up to 30 days. Address how the system handles extreme traffic spikes, ensures minimal impact on application performance, and manages the operational complexity of large-scale indexing. Discuss trade-offs between ingestion latency, search performance, and storage costs.
Kafka
OpenSearch
S3
FluentBit
gRPC
Protobuf
Go
Questions & Insights

Clarifying Questions

Scale and Throughput: What is the expected log volume per day and the peak ingestion rate (Events Per Second)?
Search and Query Requirements: Do we need full-text search capability, or is simple filtering by metadata (service ID, timestamp, level) sufficient?
Retention Policy: How long should logs be kept in "hot" (searchable) storage vs. "cold" (archival) storage?
Data Format: Are logs strictly structured (JSON) or can they be unstructured/plain text?
Assumptions for this design:
Scale: 100 TB of logs per day (~1.1 GB/sec avg). Peak ingestion of 2M events/sec.
Query: Full-text search and filtering are required for debugging.
Retention: 7 days searchable, 30 days archival.
Consistency: Eventual consistency for search is acceptable (ingestion-to-search lag < 10 seconds).

Thinking Process

To build a resilient and scalable logging system, we focus on decoupling ingestion from processing to handle spikes without data loss.
How do we ingest logs without impacting application performance? Use a local log agent (sidecar or daemon) to buffer logs and send them asynchronously to a central collector.
How do we handle massive write bursts? Implement a distributed message queue (Kafka) as a write buffer to decouple the high-velocity ingestion from the slower indexing process.
How do we ensure cost-effective storage? Use a tiered storage approach—indexed storage (OpenSearch) for recent logs and object storage (S3) for long-term archives.
How do we ensure search performance? Shard the search index by time (e.g., daily indices) to keep active indices small and fast.

Bonus Points

Log Sampling & Dynamic Throttling: Implement an "adaptive sampling" mechanism at the agent level to drop 90% of DEBUG logs during high-traffic spikes or when the queue is backing up.
Indexing Pre-aggregation: Use stream processing to pre-calculate metrics (e.g., error counts per service) before indexing, reducing the search load for common dashboard queries.
Zero-Copy Ingestion: Optimize the log agent to use sendfile() or similar zero-copy primitives to minimize CPU overhead on host machines.
Schema-on-Read vs. Schema-on-Write: Support flexible JSON indexing while enforcing a "Common Schema" (ECS) for critical fields like trace_id and timestamp to enable cross-service correlation.
Design Breakdown

Functional Requirements

Core Use Cases:
Centralized log ingestion from multiple services/hosts.
Full-text search and filtering via a UI/API.
Long-term archival for compliance.
Real-time tailing of log streams.
Scope Control:
In-scope: Ingestion, buffering, indexing, search, and archival.
Out-of-scope: Real-time alerting/anomaly detection (separate system), Tracing/APM (handled by Zipkin/Jaeger).

Non-Functional Requirements

Scale: Support 1M+ events per second.
Latency: Ingestion-to-search lag < 10s; Search queries < 2s for 24h range.
Availability & Reliability: 99.9% availability. Logs must not be lost once acknowledged by the ingestion API (Durability).
Consistency: Eventual consistency for search; Strong durability for archival.
Fault Tolerance: Handle regional or availability zone failures in the ingestion pipeline.

Estimation

Traffic:
1M events/sec * 86,400s \approx 86 Billion events/day.
Avg event size: 1 KB (including metadata).
Total Daily Data: ~86 TB/day.
Storage:
Searchable (7 days): 86 TB * 7 \approx 600 TB (uncompressed). With indexing overhead and replication, ~1.2 PB.
Archival (30 days): 86 TB * 30 \approx 2.6 PB. Compressed (3:1), ~900 TB.
Bandwidth:
Ingress: ~1 GB/sec (Continuous).
Egress: Variable based on search queries, likely much lower.

Blueprint

The design follows a "Collect-Buffer-Index-Store" architecture. It uses a lightweight agent on host machines to minimize resource usage, Kafka to protect the system from ingestion spikes, and OpenSearch for high-performance searching.
Log Agent: Lightweight sidecar (FluentBit) that gathers logs locally.
Ingestion Service: Stateless API that receives logs and writes to Kafka.
Message Bus (Kafka): High-throughput persistent buffer.
Log Indexer: Consumer group that transforms and pushes logs to the search engine.
Search Engine (OpenSearch): Distributed text search and storage.
Object Storage (S3): Low-cost cold storage for raw logs.
Simplicity Audit: This design avoids complex stream processing frameworks (like Flink) in the MVP stage, using a simple consumer group for indexing.
Architecture Decision Rationale:
Kafka is chosen because it allows independent scaling of the producers (applications) and consumers (indexers).
OpenSearch is the industry standard for high-volume log search due to its inverted index.
S3 provides virtually infinite durability and low cost for the 90% of logs that are never searched but must be kept.

High Level Architecture

Sub-system Deep Dive

Service

Topology & Scaling:
Ingestion Service: Stateless Go/Rust service deployed in an Auto-Scaling Group (ASG) based on CPU/Request count.
Log Indexer: Containerized consumers scaling based on Kafka consumer lag.
API Schema Design:
POST /v1/logs
Protocol: gRPC (for high-efficiency ingestion) or REST/JSON.
Request: [{ "timestamp": long, "level": string, "service": string, "message": string, "metadata": map }]
Idempotency: Client-generated log_id for deduplication.
Resilience:
Backpressure: If Kafka is full, the Ingestion Service returns HTTP 429 (Too Many Requests), signaling the Log Agent to buffer on disk.
Retries: Log Agents use exponential backoff with jitter when sending to the Ingestion Service.

Storage

Access Pattern:
99% Write, 1% Read.
Search queries are mostly for the last 1-6 hours of data.
Database Table Design (OpenSearch Indices):
Index Pattern: logs-{service_id}-{yyyy-mm-dd}.
Mapping: timestamp (date), level (keyword), message (text), trace_id (keyword).
Technical Selection:
OpenSearch: Best for full-text search.
S3: Object storage for partitioned Parquet files (partitioned by dt=YYYY-MM-DD/hh=HH/service=...).
Distribution Logic:
Sharding: OpenSearch indices are sharded by trace_id or log_id to distribute load.
Hot/Warm Architecture: Recent logs on SSD nodes; older logs moved to HDD nodes before deletion.

Messaging

Purpose: Decouples ingestion from indexing and provides a 24-hour data buffer in case the search cluster is down.
Event Schema: JSON or Protobuf. Includes raw message and enriched metadata (e.g., host IP, cluster ID).
Throughput & Partitioning:
Partitioned by service_id or host_id to ensure log ordering per service instance.
500+ partitions to support high-parallelism indexers.
Failure Handling:
Dead-letter Queue (DLQ): Logs that fail indexing (e.g., due to schema mapping conflicts) are sent to a separate Kafka topic for manual inspection.

Data Processing

Processing Model: Micro-batching. The Indexer pulls a batch of 500-1000 messages from Kafka and performs a bulk write to OpenSearch.
Transformations:
Timestamp normalization.
PII Masking (e.g., scrubbing credit card numbers via Regex).
Geo-IP lookup for IP addresses.
Technical Selection: Custom Go-based worker for low memory footprint and high concurrency.
Wrap Up

Advanced Topics

Trade-offs (PACELC): We prioritize Availability and Partition Tolerance (AP). It is better to have delayed logs in search than to block the application from writing them.
Reliability:
Disk Buffer: The Log Agent (FluentBit) is configured with a local disk buffer. If the network or Ingestion Service is down, logs are saved to the host disk.
Bottleneck Analysis:
OpenSearch Indexing: The primary bottleneck. We use "Index Lifecycle Management" (ILM) to roll over indices based on size (e.g., 50GB per shard) to keep them manageable.
Security:
mTLS for all traffic from Agent to Ingestion Service.
RBAC in OpenSearch to restrict developers to only see logs from their specific team's services.
Distinguishing Insights:
Secondary Indexing: For the S3 archival, we can generate a "Summary Index" (Bloom Filters or Min/Max timestamp files) to allow the Search API to quickly identify which S3 files to pull for historical deep-dives without scanning the whole bucket.