DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
The Question
Design

Distributed Tracing System Design

Design a high-scale distributed tracing system for a global microservices architecture. The system must support tracing requests across thousands of services, handle billions of spans daily with minimal application-side overhead, and provide sub-minute visibility for root-cause analysis. Address how you would handle data ingestion bursts, efficient storage for search, and the trade-offs between different sampling strategies to manage cost and observability value.
OpenTelemetry
Kafka
ElasticSearch
gRPC
B3 Propagation
W3C Trace Context
Sidecar Pattern
NoSQL
Questions & Insights

Clarifying Questions

What is the expected scale (Spans per second or Requests per second)?
Assumption: The system should handle 1M requests/sec at peak, with an average of 10 spans per request (10M spans/sec).
What is the acceptable performance overhead on the instrumented services?
Assumption: Maximum 1-2ms overhead per request; spans must be reported asynchronously to avoid blocking the hot path.
What is the data retention policy for trace data?
Assumption: 7 days of full trace data for debugging, with aggregated metrics kept longer.
Do we need tail-based sampling (sampling after the trace is complete) or head-based sampling (sampling at the start)?
Assumption: Head-based sampling for the MVP to minimize complexity, with a 1-5% sampling rate.
Is multi-tenancy a requirement?
Assumption: Not for the MVP; the system will serve a single large-scale microservice ecosystem.

Thinking Process

Core Bottleneck: The sheer volume of telemetry data can overwhelm storage and network.
Key Strategy: Use asynchronous sidecar agents for collection, a distributed message bus for backpressure, and aggressive head-based sampling to reduce the "firehose" to a manageable stream.
Progressive Logic:
How do we uniquely identify a request across services? (Distributed Context Propagation via TraceID).
How do we collect data without killing application performance? (OTel Agent sidecars + Async UDP/gRPC).
How do we handle massive write bursts to the database? (Kafka as a buffer).
How do we make the data searchable for root-cause analysis? (ElasticSearch with Time-Series-Data-Stream).

Bonus Points

Tail-based Sampling: Implementing a buffer (e.g., in the OTel Collector) to keep all spans for a trace until it completes, then deciding to save it if it contains an error or high latency (>99th percentile).
Adaptive Sampling: Dynamically adjusting sampling rates based on service health or traffic volume to keep the ingestion rate constant.
PII Scrubbing at the Edge: Using OTel Collector processors to automatically redact sensitive information (e.g., credit card numbers in span attributes) before it leaves the service network.
Query Optimization: Using "Searchable Snapshots" in ElasticSearch to reduce costs for older data while keeping it queryable.
Design Breakdown

Functional Requirements

Core Use Cases:
Track a single request flow across multiple heterogeneous microservices.
Visualize trace spans in a timeline (Gantt chart).
Search traces by specific tags (e.g., user_id, http.status_code, service_name).
Correlate logs and metrics with specific Trace IDs.
Scope Control:
In-scope: Instrumentation, collection, ingestion pipeline, storage, and basic UI query.
Out-of-scope: Automatic anomaly detection, automated root cause remediation, and long-term cold storage (Glacier/S3).

Non-Functional Requirements

Scale: Support 10M spans/sec ingestion.
Latency: End-to-end trace visibility within 1 minute (near real-time). Minimal application-side latency (<1ms).
Availability: 99.9% availability. Tracing system failure must never cause the primary application to fail (fail-open).
Consistency: Eventual consistency is acceptable; spans might arrive out of order.
Security: Encrypted transit (TLS) and Role-Based Access Control (RBAC) for viewing traces.

Estimation

Traffic: 10M spans/sec.
Sampling (1%): 100k spans/sec recorded.
Storage:
1 Span \approx 1KB.
100k spans/sec \times 1KB = 100MB/s (800Mbps) ingestion.
Daily storage: 100MB/s \times 86,400s \approx 8.6 TB/day.
7-day retention: ~60 TB (uncompressed). With ES overhead/replicas: ~120 TB.
Bandwidth: 100MB/s incoming to the Kafka cluster.

Blueprint

Concise Summary: A sidecar-based collection system that uses OpenTelemetry for instrumentation, Kafka for ingestion buffering, and ElasticSearch for indexed searching of spans.
Major Components:
OpenTelemetry SDK: Embedded in microservices to generate spans and propagate context headers (B3/W3C).
OTel Agent (Sidecar): Local collector that receives spans from the app via localhost to minimize network overhead.
Kafka Cluster: Acts as a high-throughput buffer to decouple data collection from storage indexing.
Trace Ingester: A scalable consumer group that processes spans from Kafka and writes them to ElasticSearch.
ElasticSearch: The primary storage and search engine for span data.
Simplicity Audit: This design avoids complex tail-based sampling logic in the MVP, relying on simple head-based sampling to control volume. It uses industry-standard OTel components to reduce custom code.
Architecture Decision Rationale:
Kafka: Essential because ElasticSearch cannot handle the massive, spiky write volume of raw spans without a buffer.
ElasticSearch: Best-in-class for full-text search on span tags and attributes.
OpenTelemetry: Provides a vendor-agnostic standard, ensuring the system can support any programming language.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery & Traffic Routing: Not majorly impactful for the tracing backend itself, but the Query UI should be behind a standard Load Balancer.
Security & Perimeter:
API Gateway: Must ensure traceparent headers are stripped or sanitized from external client requests to prevent "trace injection" attacks.
Authentication: OTel Collector should require an API Key or mTLS from the agents to prevent rogue data ingestion.

Service

Topology & Scaling:
OTel Agent: Deployed as a DaemonSet (Kubernetes) or Sidecar. Scales 1:1 with compute nodes.
OTel Collector: Stateless deployment in K8s, scaled horizontally based on CPU/Memory usage.
API Schema Design (Internal):
Span Export (gRPC):
Endpoint: Export(ExportTraceServiceRequest)
Protocol: OTLP (OpenTelemetry Line Protocol) over gRPC.
Idempotency: Not strictly required for telemetry.
Resilience:
Circuit Breaker: If the Agent/Collector is overloaded, the SDK should drop spans locally rather than slowing down the application.
Load Leveling: Kafka handles spikes in span generation (e.g., during a deployment or an incident).

Storage

Access Pattern: Write-heavy (100k writes/sec). Reads are infrequent but require low latency for debugging.
Database Table Design (ElasticSearch Index Mapping):
trace_id: Keyword (Index for lookup).
span_id: Keyword.
parent_span_id: Keyword.
operation_name: Keyword.
start_time: Date (Nanosecond precision).
duration_nano: Long.
tags: Flattened/Nested object for searchability (e.g., http.method, service.name).
Technical Selection: ElasticSearch/OpenSearch.
Rationale: Support for high-volume indexing and complex filtering on tags.
Distribution Logic:
Sharding: Index per day (e.g., traces-2023-10-27).
Rollover Policy: Use ILM (Index Lifecycle Management) to delete indices after 7 days.
Sharding Key: trace_id (ensures all spans for a trace are likely in the same shard if routing is used, though often time-based is preferred for logs).

Messaging

Purpose: Buffering and Decoupling.
Event Schema: OTLP Protobuf encoded spans.
Throughput & Partitioning:
Partition by trace_id to ensure ordering of spans within a trace for any downstream stateful processing (like tail-sampling in the future).
Technical Selection: Kafka.
Rationale: High throughput, durable retention, and excellent support for consumer groups.

Data Processing

Processing Model: The Trace Ingester is a simple stream processor.
Processing DAG: Kafka Source -> Batching -> ES Sink.
Correctness: "At-least-once" delivery. Duplicate spans are naturally handled by ES using the span_id as the document ID.
Technical Selection: Go-based custom consumer or OTel Collector Export.
Rationale: High performance and low memory footprint for simple IO-bound tasks.

Infrastructure (Optional)

Observability: The tracing system must trace itself (Meta-tracing). Monitor Kafka lag and ES indexing latency.
Distributed Coordination: Not required for simple ingestion, but Kafka handles its own partition balancing.
Wrap Up

Advanced Topics

Trade-offs (PACELC): We choose Availability/Partition Tolerance over Consistency. If a span is lost, it's better than the application crashing.
Optimization (Sampling):
Head-based: Sample at the entry point (API Gateway). The decision is propagated via the sampled flag in the trace header.
Cost Optimization: Store only "Summary" metrics (Aggregated duration/error counts) for 100% of traffic, but only full traces for 1%.
Bottleneck Analysis:
ES Indexing: If ES hits a limit, Kafka lag will increase. We scale ES by adding data nodes and increasing shard counts.
Hot Shards: Avoid using service_name as a routing key; stick to trace_id for even distribution.
Distinguishing Insights:
Context Propagation: Use the W3C Trace Context standard to ensure interoperability with third-party vendors or legacy systems.
Clock Skew: Implement logic in the UI to normalize span start times across services with drifting system clocks using parent-child relationship heuristics.