The Question

Scalable Centralized Logging and Observability Platform

Design a high-scale centralized logging system capable of ingesting 100 TB of data per day from thousands of microservices. The system must support real-time full-text search for the most recent 7 days of logs and provide cost-effective archival for up to 30 days. Address how the system handles extreme traffic spikes, ensures minimal impact on application performance, and manages the operational complexity of large-scale indexing. Discuss trade-offs between ingestion latency, search performance, and storage costs.

Kafka

OpenSearch

FluentBit

gRPC

Protobuf

Questions & Insights

Clarifying Questions

Scale and Throughput: What is the expected log volume per day and the peak ingestion rate (Events Per Second)?

Search and Query Requirements: Do we need full-text search capability, or is simple filtering by metadata (service ID, timestamp, level) sufficient?

Retention Policy: How long should logs be kept in "hot" (searchable) storage vs. "cold" (archival) storage?

Data Format: Are logs strictly structured (JSON) or can they be unstructured/plain text?

Assumptions for this design:

Scale: 100 TB of logs per day (~1.1 GB/sec avg). Peak ingestion of 2M events/sec.

Query: Full-text search and filtering are required for debugging.

Retention: 7 days searchable, 30 days archival.

Consistency: Eventual consistency for search is acceptable (ingestion-to-search lag < 10 seconds).

Thinking Process

To build a resilient and scalable logging system, we focus on decoupling ingestion from processing to handle spikes without data loss.

How do we ingest logs without impacting application performance? Use a local log agent (sidecar or daemon) to buffer logs and send them asynchronously to a central collector.

How do we handle massive write bursts? Implement a distributed message queue (Kafka) as a write buffer to decouple the high-velocity ingestion from the slower indexing process.

How do we ensure cost-effective storage? Use a tiered storage approach—indexed storage (OpenSearch) for recent logs and object storage (S3) for long-term archives.

How do we ensure search performance? Shard the search index by time (e.g., daily indices) to keep active indices small and fast.

Bonus Points

Log Sampling & Dynamic Throttling: Implement an "adaptive sampling" mechanism at the agent level to drop 90% of DEBUG logs during high-traffic spikes or when the queue is backing up.

Indexing Pre-aggregation: Use stream processing to pre-calculate metrics (e.g., error counts per service) before indexing, reducing the search load for common dashboard queries.

Zero-Copy Ingestion: Optimize the log agent to use sendfile() or similar zero-copy primitives to minimize CPU overhead on host machines.

Schema-on-Read vs. Schema-on-Write: Support flexible JSON indexing while enforcing a "Common Schema" (ECS) for critical fields like trace_id and timestamp to enable cross-service correlation.

Design Breakdown

Functional Requirements

Core Use Cases:

Centralized log ingestion from multiple services/hosts.

Full-text search and filtering via a UI/API.

Long-term archival for compliance.

Real-time tailing of log streams.

Scope Control:

In-scope: Ingestion, buffering, indexing, search, and archival.

Out-of-scope: Real-time alerting/anomaly detection (separate system), Tracing/APM (handled by Zipkin/Jaeger).

Non-Functional Requirements

Scale: Support 1M+ events per second.

Latency: Ingestion-to-search lag < 10s; Search queries < 2s for 24h range.

Availability & Reliability: 99.9% availability. Logs must not be lost once acknowledged by the ingestion API (Durability).

Consistency: Eventual consistency for search; Strong durability for archival.

Fault Tolerance: Handle regional or availability zone failures in the ingestion pipeline.

Estimation

Traffic:

1M events/sec * 86,400s

\approx

86 Billion events/day.

Avg event size: 1 KB (including metadata).

Total Daily Data: ~86 TB/day.

Storage:

Searchable (7 days): 86 TB * 7

\approx

600 TB (uncompressed). With indexing overhead and replication, ~1.2 PB.

Archival (30 days): 86 TB * 30

\approx

2.6 PB. Compressed (3:1), ~900 TB.

Bandwidth:

Ingress: ~1 GB/sec (Continuous).

Egress: Variable based on search queries, likely much lower.

Blueprint

The design follows a "Collect-Buffer-Index-Store" architecture. It uses a lightweight agent on host machines to minimize resource usage, Kafka to protect the system from ingestion spikes, and OpenSearch for high-performance searching.

Log Agent: Lightweight sidecar (FluentBit) that gathers logs locally.

Ingestion Service: Stateless API that receives logs and writes to Kafka.

Message Bus (Kafka): High-throughput persistent buffer.

Log Indexer: Consumer group that transforms and pushes logs to the search engine.

Search Engine (OpenSearch): Distributed text search and storage.

Object Storage (S3): Low-cost cold storage for raw logs.

Simplicity Audit: This design avoids complex stream processing frameworks (like Flink) in the MVP stage, using a simple consumer group for indexing.

Architecture Decision Rationale:

Kafka is chosen because it allows independent scaling of the producers (applications) and consumers (indexers).

OpenSearch is the industry standard for high-volume log search due to its inverted index.

S3 provides virtually infinite durability and low cost for the 90% of logs that are never searched but must be kept.

High Level Architecture

Sub-system Deep Dive

Service

Topology & Scaling:

Ingestion Service: Stateless Go/Rust service deployed in an Auto-Scaling Group (ASG) based on CPU/Request count.

Log Indexer: Containerized consumers scaling based on Kafka consumer lag.

API Schema Design:

POST /v1/logs

Protocol: gRPC (for high-efficiency ingestion) or REST/JSON.

Request: [{ "timestamp": long, "level": string, "service": string, "message": string, "metadata": map }]

Idempotency: Client-generated log_id for deduplication.

Resilience:

Backpressure: If Kafka is full, the Ingestion Service returns HTTP 429 (Too Many Requests), signaling the Log Agent to buffer on disk.

Retries: Log Agents use exponential backoff with jitter when sending to the Ingestion Service.

Storage

Access Pattern:

99% Write, 1% Read.

Search queries are mostly for the last 1-6 hours of data.

Database Table Design (OpenSearch Indices):

Index Pattern: logs-{service_id}-{yyyy-mm-dd}.

Mapping: timestamp (date), level (keyword), message (text), trace_id (keyword).

Technical Selection:

OpenSearch: Best for full-text search.

S3: Object storage for partitioned Parquet files (partitioned by dt=YYYY-MM-DD/hh=HH/service=...).

Distribution Logic:

Sharding: OpenSearch indices are sharded by trace_id or log_id to distribute load.

Hot/Warm Architecture: Recent logs on SSD nodes; older logs moved to HDD nodes before deletion.

Messaging

Purpose: Decouples ingestion from indexing and provides a 24-hour data buffer in case the search cluster is down.

Event Schema: JSON or Protobuf. Includes raw message and enriched metadata (e.g., host IP, cluster ID).

Throughput & Partitioning:

Partitioned by service_id or host_id to ensure log ordering per service instance.

500+ partitions to support high-parallelism indexers.

Failure Handling:

Dead-letter Queue (DLQ): Logs that fail indexing (e.g., due to schema mapping conflicts) are sent to a separate Kafka topic for manual inspection.

Data Processing

Processing Model: Micro-batching. The Indexer pulls a batch of 500-1000 messages from Kafka and performs a bulk write to OpenSearch.

Transformations:

Timestamp normalization.

PII Masking (e.g., scrubbing credit card numbers via Regex).

Geo-IP lookup for IP addresses.

Technical Selection: Custom Go-based worker for low memory footprint and high concurrency.

Wrap Up

Advanced Topics

Trade-offs (PACELC): We prioritize Availability and Partition Tolerance (AP). It is better to have delayed logs in search than to block the application from writing them.

Reliability:

Disk Buffer: The Log Agent (FluentBit) is configured with a local disk buffer. If the network or Ingestion Service is down, logs are saved to the host disk.

Bottleneck Analysis:

OpenSearch Indexing: The primary bottleneck. We use "Index Lifecycle Management" (ILM) to roll over indices based on size (e.g., 50GB per shard) to keep them manageable.

Security:

mTLS for all traffic from Agent to Ingestion Service.

RBAC in OpenSearch to restrict developers to only see logs from their specific team's services.

Distinguishing Insights:

Secondary Indexing: For the S3 archival, we can generate a "Summary Index" (Bloom Filters or Min/Max timestamp files) to allow the Search API to quickly identify which S3 files to pull for historical deep-dives without scanning the whole bucket.