The Question
DesignDistributed Tracing System Design
Design a high-scale distributed tracing system for a global microservices architecture. The system must support tracing requests across thousands of services, handle billions of spans daily with minimal application-side overhead, and provide sub-minute visibility for root-cause analysis. Address how you would handle data ingestion bursts, efficient storage for search, and the trade-offs between different sampling strategies to manage cost and observability value.
OpenTelemetry
Kafka
ElasticSearch
gRPC
B3 Propagation
W3C Trace Context
Sidecar Pattern
NoSQL
Questions & Insights
Clarifying Questions
What is the expected scale (Spans per second or Requests per second)?
Assumption: The system should handle 1M requests/sec at peak, with an average of 10 spans per request (10M spans/sec).
What is the acceptable performance overhead on the instrumented services?
Assumption: Maximum 1-2ms overhead per request; spans must be reported asynchronously to avoid blocking the hot path.
What is the data retention policy for trace data?
Assumption: 7 days of full trace data for debugging, with aggregated metrics kept longer.
Do we need tail-based sampling (sampling after the trace is complete) or head-based sampling (sampling at the start)?
Assumption: Head-based sampling for the MVP to minimize complexity, with a 1-5% sampling rate.
Is multi-tenancy a requirement?
Assumption: Not for the MVP; the system will serve a single large-scale microservice ecosystem.
Thinking Process
Core Bottleneck: The sheer volume of telemetry data can overwhelm storage and network.
Key Strategy: Use asynchronous sidecar agents for collection, a distributed message bus for backpressure, and aggressive head-based sampling to reduce the "firehose" to a manageable stream.
Progressive Logic:
How do we uniquely identify a request across services? (Distributed Context Propagation via TraceID).
How do we collect data without killing application performance? (OTel Agent sidecars + Async UDP/gRPC).
How do we handle massive write bursts to the database? (Kafka as a buffer).
How do we make the data searchable for root-cause analysis? (ElasticSearch with Time-Series-Data-Stream).
Bonus Points
Tail-based Sampling: Implementing a buffer (e.g., in the OTel Collector) to keep all spans for a trace until it completes, then deciding to save it if it contains an error or high latency (>99th percentile).
Adaptive Sampling: Dynamically adjusting sampling rates based on service health or traffic volume to keep the ingestion rate constant.
PII Scrubbing at the Edge: Using OTel Collector processors to automatically redact sensitive information (e.g., credit card numbers in span attributes) before it leaves the service network.
Query Optimization: Using "Searchable Snapshots" in ElasticSearch to reduce costs for older data while keeping it queryable.
Design Breakdown
Functional Requirements
Core Use Cases:
Track a single request flow across multiple heterogeneous microservices.
Visualize trace spans in a timeline (Gantt chart).
Search traces by specific tags (e.g.,
user_id, http.status_code, service_name).Correlate logs and metrics with specific Trace IDs.
Scope Control:
In-scope: Instrumentation, collection, ingestion pipeline, storage, and basic UI query.
Out-of-scope: Automatic anomaly detection, automated root cause remediation, and long-term cold storage (Glacier/S3).
Non-Functional Requirements
Scale: Support 10M spans/sec ingestion.
Latency: End-to-end trace visibility within 1 minute (near real-time). Minimal application-side latency (<1ms).
Availability: 99.9% availability. Tracing system failure must never cause the primary application to fail (fail-open).
Consistency: Eventual consistency is acceptable; spans might arrive out of order.
Security: Encrypted transit (TLS) and Role-Based Access Control (RBAC) for viewing traces.
Estimation
Traffic: 10M spans/sec.
Sampling (1%): 100k spans/sec recorded.
Storage:
1 Span \approx 1KB.
100k spans/sec \times 1KB = 100MB/s (800Mbps) ingestion.
Daily storage: 100MB/s \times 86,400s \approx 8.6 TB/day.
7-day retention: ~60 TB (uncompressed). With ES overhead/replicas: ~120 TB.
Bandwidth: 100MB/s incoming to the Kafka cluster.
Blueprint
Concise Summary: A sidecar-based collection system that uses OpenTelemetry for instrumentation, Kafka for ingestion buffering, and ElasticSearch for indexed searching of spans.
Major Components:
OpenTelemetry SDK: Embedded in microservices to generate spans and propagate context headers (B3/W3C).
OTel Agent (Sidecar): Local collector that receives spans from the app via localhost to minimize network overhead.
Kafka Cluster: Acts as a high-throughput buffer to decouple data collection from storage indexing.
Trace Ingester: A scalable consumer group that processes spans from Kafka and writes them to ElasticSearch.
ElasticSearch: The primary storage and search engine for span data.
Simplicity Audit: This design avoids complex tail-based sampling logic in the MVP, relying on simple head-based sampling to control volume. It uses industry-standard OTel components to reduce custom code.
Architecture Decision Rationale:
Kafka: Essential because ElasticSearch cannot handle the massive, spiky write volume of raw spans without a buffer.
ElasticSearch: Best-in-class for full-text search on span tags and attributes.
OpenTelemetry: Provides a vendor-agnostic standard, ensuring the system can support any programming language.
High Level Architecture
Sub-system Deep Dive
Edge (Optional)
Content Delivery & Traffic Routing: Not majorly impactful for the tracing backend itself, but the Query UI should be behind a standard Load Balancer.
Security & Perimeter:
API Gateway: Must ensure
traceparent headers are stripped or sanitized from external client requests to prevent "trace injection" attacks.Authentication: OTel Collector should require an API Key or mTLS from the agents to prevent rogue data ingestion.
Service
Topology & Scaling:
OTel Agent: Deployed as a DaemonSet (Kubernetes) or Sidecar. Scales 1:1 with compute nodes.
OTel Collector: Stateless deployment in K8s, scaled horizontally based on CPU/Memory usage.
API Schema Design (Internal):
Span Export (gRPC):
Endpoint:
Export(ExportTraceServiceRequest)Protocol: OTLP (OpenTelemetry Line Protocol) over gRPC.
Idempotency: Not strictly required for telemetry.
Resilience:
Circuit Breaker: If the Agent/Collector is overloaded, the SDK should drop spans locally rather than slowing down the application.
Load Leveling: Kafka handles spikes in span generation (e.g., during a deployment or an incident).
Storage
Access Pattern: Write-heavy (100k writes/sec). Reads are infrequent but require low latency for debugging.
Database Table Design (ElasticSearch Index Mapping):
trace_id: Keyword (Index for lookup).span_id: Keyword.parent_span_id: Keyword.operation_name: Keyword.start_time: Date (Nanosecond precision).duration_nano: Long.tags: Flattened/Nested object for searchability (e.g., http.method, service.name).Technical Selection: ElasticSearch/OpenSearch.
Rationale: Support for high-volume indexing and complex filtering on tags.
Distribution Logic:
Sharding: Index per day (e.g.,
traces-2023-10-27).Rollover Policy: Use ILM (Index Lifecycle Management) to delete indices after 7 days.
Sharding Key:
trace_id (ensures all spans for a trace are likely in the same shard if routing is used, though often time-based is preferred for logs).Messaging
Purpose: Buffering and Decoupling.
Event Schema: OTLP Protobuf encoded spans.
Throughput & Partitioning:
Partition by
trace_id to ensure ordering of spans within a trace for any downstream stateful processing (like tail-sampling in the future).Technical Selection: Kafka.
Rationale: High throughput, durable retention, and excellent support for consumer groups.
Data Processing
Processing Model: The Trace Ingester is a simple stream processor.
Processing DAG:
Kafka Source -> Batching -> ES Sink.Correctness: "At-least-once" delivery. Duplicate spans are naturally handled by ES using the
span_id as the document ID.Technical Selection: Go-based custom consumer or OTel Collector Export.
Rationale: High performance and low memory footprint for simple IO-bound tasks.
Infrastructure (Optional)
Observability: The tracing system must trace itself (Meta-tracing). Monitor Kafka lag and ES indexing latency.
Distributed Coordination: Not required for simple ingestion, but Kafka handles its own partition balancing.
Wrap Up
Advanced Topics
Trade-offs (PACELC): We choose Availability/Partition Tolerance over Consistency. If a span is lost, it's better than the application crashing.
Optimization (Sampling):
Head-based: Sample at the entry point (API Gateway). The decision is propagated via the
sampled flag in the trace header.Cost Optimization: Store only "Summary" metrics (Aggregated duration/error counts) for 100% of traffic, but only full traces for 1%.
Bottleneck Analysis:
ES Indexing: If ES hits a limit, Kafka lag will increase. We scale ES by adding data nodes and increasing shard counts.
Hot Shards: Avoid using
service_name as a routing key; stick to trace_id for even distribution.Distinguishing Insights:
Context Propagation: Use the W3C Trace Context standard to ensure interoperability with third-party vendors or legacy systems.
Clock Skew: Implement logic in the UI to normalize span start times across services with drifting system clocks using parent-child relationship heuristics.