The Question

Monitoring and Logging System

Design a high-scale, end-to-end monitoring and logging system for a distributed microservices environment. The system must support metrics collection for real-time alerting and log aggregation for forensic debugging. Ensure the design can handle at least 1M metrics per second and TB-scale daily log volumes. Address critical concerns regarding ingestion bottlenecks, storage cost optimization, data retention, and the isolation of the monitoring infrastructure from the production environment.

OpenTelemetry

Kafka

VictoriaMetrics

OpenSearch

Grafana

gRPC

mTLS

Protobuf

Questions & Insights

Clarifying Questions

What is the scale of the infrastructure? (Assumed: 10,000 servers, 1,000 microservices, producing ~1M metrics/sec and ~500GB logs/day).

What are the latency requirements for alerting? (Assumed: Critical alerts must be fired within 60 seconds of the event).

What is the required data retention policy? (Assumed: 15 days for raw logs, 30 days for high-resolution metrics, 1 year for down-sampled metrics).

How will data be ingested? (Assumed: Push-based model where agents on hosts send data to a central gateway to simplify firewall/security rules).

Who are the primary users? (Assumed: SREs and Developers using a unified dashboard like Grafana).

Thinking Process

Core Strategy: Separate the "Metrics Path" (structured, numerical, high-frequency) from the "Log Path" (unstructured, high-volume, text-searchable) to optimize for different storage engine strengths (TSDB vs. Search Engine).

Key Points:

Isolation: Monitoring must not fail when the system it monitors fails.

High Write Throughput: Use a messaging buffer (Kafka) to decouple ingestion from storage.

Cardinality Management: Prevent "Cardinality Explosions" in metrics that can crash a TSDB.

Query Efficiency: Implement down-sampling for long-term metric trends.

Bonus Points

Log Sampling & Dynamic Levels: Implement a mechanism to dynamically change log levels (e.g., INFO to DEBUG) across the fleet without restarts or to sample 1% of successful logs while keeping 100% of errors.

Backpressure Handling: If the storage layer lags, the ingestion gateway should drop non-critical logs first to protect metric delivery for alerting.

T-Digest for Percentiles: Use T-Digest or HdrHistogram algorithms for calculating accurate P99 latencies across distributed nodes rather than simple averaging.

Cost-Optimized Storage: Utilizing Object Storage (S3) with an indexing layer (like Loki or specialized TSDB blocks) to reduce the cost of long-term log retention by 10x compared to SSD-backed clusters.

Design Breakdown

Functional Requirements

Core Use Cases:

Collect and store system-level (CPU/RAM) and application-level (Request Count/Latency) metrics.

Collect, index, and search application logs.

Define and trigger alerts based on threshold violations or heartbeat failures.

Visualize data via interactive dashboards.

Scope Control:

In-scope: Real-time metrics, log aggregation, alerting, and dashboarding.

Out-of-scope: Distributed tracing (e.g., Jaeger/Zipkin), RUM (Real User Monitoring), and automated incident remediation.

Non-Functional Requirements

Scale: Must handle 1M+ data points per second and bursty log traffic.

Latency: Sub-second ingestion for metrics; < 5s for log searchability.

Availability & Reliability: 99.9% availability; the monitoring system should be deployed in a separate failure domain (management VPC).

Consistency: Eventual consistency is acceptable for logs; metrics should prioritize availability over strict consistency.

Security: Encrypted transit (TLS), RBAC for data access, and PII masking in logs.

Estimation

Metrics Traffic: 1M metrics/sec * 8 bytes/sample ≈ 8 MB/sec raw. With metadata/tags: ~40 MB/sec.

Log Traffic: 500 GB/day ≈ 6 MB/sec average. Peak (10x) ≈ 60 MB/sec.

Storage (Logs): 500 GB/day * 15 days = 7.5 TB.

Storage (Metrics): 40 MB/s * 86400s ≈ 3.4 TB/day (Raw). With compression (10:1) and down-sampling: ~10 TB for 30 days.

Blueprint

Concise Summary: A unified agent (OpenTelemetry) collects data and pushes it to an Ingestion Gateway. Metrics flow into a Time-Series Database (TSDB) for fast alerting, while logs are buffered in Kafka and indexed in a Search Engine (OpenSearch) for debugging.

Major Components:

Unified Agent: Lightweight sidecar or daemonset on every node to collect logs and metrics.

Ingestion Gateway: Stateless service for auth, rate-limiting, and routing traffic.

Kafka Buffer: Decouples the bursty log ingestion from the heavy indexing process.

TSDB (VictoriaMetrics/Prometheus): High-performance storage for numerical data.

Search Engine (OpenSearch): Distributed text indexer for log search.

Alert Manager: Evaluates rules against the TSDB and sends notifications.

Simplicity Audit: This design avoids complex stream processing (like Spark) for the MVP, relying on the native alerting capabilities of the TSDB and Search Engine.

Architecture Decision Rationale:

Why this architecture?: Decoupling metrics and logs allows for independent scaling. Logs are heavy on IO/Disk; metrics are heavy on Memory/CPU.

Functional Satisfaction: Meets the need for both "What is happening?" (Metrics) and "Why is it happening?" (Logs).

Non-functional Satisfaction: Kafka provides a safety buffer during traffic spikes, ensuring no data loss during storage maintenance.

High Level Architecture

Sub-system Deep Dive

Service

Topology & Scaling: Ingestion Gateway is a stateless Go-based service deployed in an Auto Scaling Group (ASG) across multiple Availability Zones.

API Schema Design:

Endpoint: POST /v1/ingest

Protocol: gRPC (for high-throughput agent-to-server communication).

Schema: Protobuf containing metadata (host, env), metrics[], and logs[].

Resilience: Agents use a local disk buffer (disk-assisted queues) to survive temporary network outages between the node and the gateway.

Storage

Access Pattern: Metrics are 95% writes; logs are 99% writes. Queries are infrequent but must be fast.

Technical Selection:

Metrics: VictoriaMetrics or Prometheus. VictoriaMetrics is preferred for MVP due to better vertical/horizontal scaling and high compression.

Logs: OpenSearch. Provides robust full-text indexing and is the industry standard for log analysis.

Distribution Logic:

TSDB: Sharded by metric name + tags (hash-based).

OpenSearch: Time-based indices (e.g., one index per day) to allow easy deletion of old data by dropping the index.

Reliability: Multi-AZ replication (3 replicas for OpenSearch, 2 for TSDB).

Messaging

Purpose & Decoupling: Kafka acts as a write-ahead log for logs. Indexing in OpenSearch is expensive; Kafka allows the "Log Indexer" to pull data at its own pace.

Throughput & Partitioning: Partitioned by ServiceID to ensure log ordering for a specific service while allowing horizontal scaling of indexer consumers.

Technical Selection: Kafka. High throughput and reliable retention.

Data Processing

Processing Model: A "Log Indexer" service (Multi-process) reads from Kafka.

Tasks:

Transformation: Parse JSON/Regex.

Enrichment: Add GeoIP or Service metadata.

Filtering: Drop "INFO" logs for non-production environments to save costs.

Technical Selection: Vector.dev or custom Go workers for high performance.

Infrastructure (Optional)

Observability: The monitoring system monitors itself (meta-monitoring) using a separate, smaller instance of the same stack or a managed service (SaaS) to avoid circular dependencies.

Wrap Up

Advanced Topics

Trade-offs: We chose a "Push" model for the agents. Alternative: A "Pull" model (like standard Prometheus) is harder to manage in dynamic cloud environments with many short-lived containers. Push is easier for the MVP.

Reliability: We use Dead Letter Queues (DLQ) in the Log Indexer. If a log message is malformed and cannot be indexed, it's moved to S3 for manual inspection rather than blocking the pipeline.

Bottleneck Analysis: The primary bottleneck is the OpenSearch indexing rate. We mitigate this by using "Bulk Ingest" APIs and tuning the refresh_interval.

Security: PII scrubbing is performed at the "Unified Agent" level (the edge) to ensure sensitive data (e.g., credit card numbers) never leaves the host.

Distinguishing Insights: To handle Cardinality Explosions (e.g., a developer puts a UserID in a metric tag), the Ingestion Gateway should have a "top-K" analyzer that automatically alerts SREs when a single metric starts producing an abnormal number of unique series, allowing them to block that metric before it crashes the TSDB.