DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
The Question
Design

Scalable Centralized Logging and Observability Platform

Design a high-scale centralized logging system capable of ingesting 100 TB of data per day from thousands of microservices. The system must support real-time full-text search for the most recent 7 days of logs and provide cost-effective archival for up to 30 days. Address how the system handles extreme traffic spikes, ensures minimal impact on application performance, and manages the operational complexity of large-scale indexing. Discuss trade-offs between ingestion latency, search performance, and storage costs.
Kafka
OpenSearch
S3
FluentBit
gRPC
Protobuf
Go
Questions & Insights

Clarifying Questions

Scale and Throughput: What is the expected log volume per day and the peak ingestion rate (Events Per Second)?
Search and Query Requirements: Do we need full-text search capability, or is simple filtering by metadata (service ID, timestamp, level) sufficient?
Retention Policy: How long should logs be kept in "hot" (searchable) storage vs. "cold" (archival) storage?
Data Format: Are logs strictly structured (JSON) or can they be unstructured/plain text?
Assumptions for this design:
Scale: 100 TB of logs per day (~1.1 GB/sec avg). Peak ingestion of 2M events/sec.
Query: Full-text search and filtering are required for debugging.
Retention: 7 days searchable, 30 days archival.
Consistency: Eventual consistency for search is acceptable (ingestion-to-search lag < 10 seconds).

Thinking Process

To build a resilient and scalable logging system, we focus on decoupling ingestion from processing to handle spikes without data loss.
How do we ingest logs without impacting application performance? Use a local log agent (sidecar or daemon) to buffer logs and send them asynchronously to a central collector.
How do we handle massive write bursts? Implement a distributed message queue (Kafka) as a write buffer to decouple the high-velocity ingestion from the slower indexing process.
How do we ensure cost-effective storage? Use a tiered storage approach—indexed storage (OpenSearch) for recent logs and object storage (S3) for long-term archives.
How do we ensure search performance? Shard the search index by time (e.g., daily indices) to keep active indices small and fast.

Bonus Points

Log Sampling & Dynamic Throttling: Implement an "adaptive sampling" mechanism at the agent level to drop 90% of DEBUG logs during high-traffic spikes or when the queue is backing up.
Indexing Pre-aggregation: Use stream processing to pre-calculate metrics (e.g., error counts per service) before indexing, reducing the search load for common dashboard queries.
Zero-Copy Ingestion: Optimize the log agent to use sendfile() or similar zero-copy primitives to minimize CPU overhead on host machines.
Schema-on-Read vs. Schema-on-Write: Support flexible JSON indexing while enforcing a "Common Schema" (ECS) for critical fields like trace_id and timestamp to enable cross-service correlation.
Design Breakdown

Functional Requirements

Core Use Cases:
Centralized log ingestion from multiple services/hosts.
Full-text search and filtering via a UI/API.
Long-term archival for compliance.
Real-time tailing of log streams.
Scope Control:
In-scope: Ingestion, buffering, indexing, search, and archival.
Out-of-scope: Real-time alerting/anomaly detection (separate system), Tracing/APM (handled by Zipkin/Jaeger).

Non-Functional Requirements

Scale: Support 1M+ events per second.
Latency: Ingestion-to-search lag < 10s; Search queries < 2s for 24h range.
Availability & Reliability: 99.9% availability. Logs must not be lost once acknowledged by the ingestion API (Durability).
Consistency: Eventual consistency for search; Strong durability for archival.
Fault Tolerance: Handle regional or availability zone failures in the ingestion pipeline.

Estimation

Traffic:
1M events/sec * 86,400s \approx 86 Billion events/day.
Avg event size: 1 KB (including metadata).
Total Daily Data: ~86 TB/day.
Storage:
Searchable (7 days): 86 TB * 7 \approx 600 TB (uncompressed). With indexing overhead and replication, ~1.2 PB.
Archival (30 days): 86 TB * 30 \approx 2.6 PB. Compressed (3:1), ~900 TB.
Bandwidth:
Ingress: ~1 GB/sec (Continuous).
Egress: Variable based on search queries, likely much lower.

Blueprint

The design follows a "Collect-Buffer-Index-Store" architecture. It uses a lightweight agent on host machines to minimize resource usage, Kafka to protect the system from ingestion spikes, and OpenSearch for high-performance searching.
Log Agent: Lightweight sidecar (FluentBit) that gathers logs locally.
Ingestion Service: Stateless API that receives logs and writes to Kafka.
Message Bus (Kafka): High-throughput persistent buffer.
Log Indexer: Consumer group that transforms and pushes logs to the search engine.
Search Engine (OpenSearch): Distributed text search and storage.
Object Storage (S3): Low-cost cold storage for raw logs.
Simplicity Audit: This design avoids complex stream processing frameworks (like Flink) in the MVP stage, using a simple consumer group for indexing.
Architecture Decision Rationale:
Kafka is chosen because it allows independent scaling of the producers (applications) and consumers (indexers).
OpenSearch is the industry standard for high-volume log search due to its inverted index.
S3 provides virtually infinite durability and low cost for the 90% of logs that are never searched but must be kept.

High Level Architecture

Sub-system Deep Dive

Service

Topology & Scaling:
Ingestion Service: Stateless Go/Rust service deployed in an Auto-Scaling Group (ASG) based on CPU/Request count.
Log Indexer: Containerized consumers scaling based on Kafka consumer lag.
API Schema Design:
POST /v1/logs
Protocol: gRPC (for high-efficiency ingestion) or REST/JSON.
Request: [{ "timestamp": long, "level": string, "service": string, "message": string, "metadata": map }]
Idempotency: Client-generated log_id for deduplication.
Resilience:
Backpressure: If Kafka is full, the Ingestion Service returns HTTP 429 (Too Many Requests), signaling the Log Agent to buffer on disk.
Retries: Log Agents use exponential backoff with jitter when sending to the Ingestion Service.

Storage

Access Pattern:
99% Write, 1% Read.
Search queries are mostly for the last 1-6 hours of data.
Database Table Design (OpenSearch Indices):
Index Pattern: logs-{service_id}-{yyyy-mm-dd}.
Mapping: timestamp (date), level (keyword), message (text), trace_id (keyword).
Technical Selection:
OpenSearch: Best for full-text search.
S3: Object storage for partitioned Parquet files (partitioned by dt=YYYY-MM-DD/hh=HH/service=...).
Distribution Logic:
Sharding: OpenSearch indices are sharded by trace_id or log_id to distribute load.
Hot/Warm Architecture: Recent logs on SSD nodes; older logs moved to HDD nodes before deletion.

Messaging

Purpose: Decouples ingestion from indexing and provides a 24-hour data buffer in case the search cluster is down.
Event Schema: JSON or Protobuf. Includes raw message and enriched metadata (e.g., host IP, cluster ID).
Throughput & Partitioning:
Partitioned by service_id or host_id to ensure log ordering per service instance.
500+ partitions to support high-parallelism indexers.
Failure Handling:
Dead-letter Queue (DLQ): Logs that fail indexing (e.g., due to schema mapping conflicts) are sent to a separate Kafka topic for manual inspection.

Data Processing

Processing Model: Micro-batching. The Indexer pulls a batch of 500-1000 messages from Kafka and performs a bulk write to OpenSearch.
Transformations:
Timestamp normalization.
PII Masking (e.g., scrubbing credit card numbers via Regex).
Geo-IP lookup for IP addresses.
Technical Selection: Custom Go-based worker for low memory footprint and high concurrency.
Wrap Up

Advanced Topics

Trade-offs (PACELC): We prioritize Availability and Partition Tolerance (AP). It is better to have delayed logs in search than to block the application from writing them.
Reliability:
Disk Buffer: The Log Agent (FluentBit) is configured with a local disk buffer. If the network or Ingestion Service is down, logs are saved to the host disk.
Bottleneck Analysis:
OpenSearch Indexing: The primary bottleneck. We use "Index Lifecycle Management" (ILM) to roll over indices based on size (e.g., 50GB per shard) to keep them manageable.
Security:
mTLS for all traffic from Agent to Ingestion Service.
RBAC in OpenSearch to restrict developers to only see logs from their specific team's services.
Distinguishing Insights:
Secondary Indexing: For the S3 archival, we can generate a "Summary Index" (Bloom Filters or Min/Max timestamp files) to allow the Search API to quickly identify which S3 files to pull for historical deep-dives without scanning the whole bucket.