The Question

Distributed Tracing System Design

Design a high-scale distributed tracing system for a global microservices architecture. The system must support tracing requests across thousands of services, handle billions of spans daily with minimal application-side overhead, and provide sub-minute visibility for root-cause analysis. Address how you would handle data ingestion bursts, efficient storage for search, and the trade-offs between different sampling strategies to manage cost and observability value.

OpenTelemetry

Kafka

ElasticSearch

gRPC

B3 Propagation

W3C Trace Context

Sidecar Pattern

NoSQL

Questions & Insights

Clarifying Questions

What is the expected scale (Spans per second or Requests per second)?

Assumption: The system should handle 1M requests/sec at peak, with an average of 10 spans per request (10M spans/sec).

What is the acceptable performance overhead on the instrumented services?

Assumption: Maximum 1-2ms overhead per request; spans must be reported asynchronously to avoid blocking the hot path.

What is the data retention policy for trace data?

Assumption: 7 days of full trace data for debugging, with aggregated metrics kept longer.

Do we need tail-based sampling (sampling after the trace is complete) or head-based sampling (sampling at the start)?

Assumption: Head-based sampling for the MVP to minimize complexity, with a 1-5% sampling rate.

Is multi-tenancy a requirement?

Assumption: Not for the MVP; the system will serve a single large-scale microservice ecosystem.

Thinking Process

Core Bottleneck: The sheer volume of telemetry data can overwhelm storage and network.

Key Strategy: Use asynchronous sidecar agents for collection, a distributed message bus for backpressure, and aggressive head-based sampling to reduce the "firehose" to a manageable stream.

Progressive Logic:

How do we uniquely identify a request across services? (Distributed Context Propagation via TraceID).

How do we collect data without killing application performance? (OTel Agent sidecars + Async UDP/gRPC).

How do we handle massive write bursts to the database? (Kafka as a buffer).

How do we make the data searchable for root-cause analysis? (ElasticSearch with Time-Series-Data-Stream).

Bonus Points

Tail-based Sampling: Implementing a buffer (e.g., in the OTel Collector) to keep all spans for a trace until it completes, then deciding to save it if it contains an error or high latency (>99th percentile).

Adaptive Sampling: Dynamically adjusting sampling rates based on service health or traffic volume to keep the ingestion rate constant.

PII Scrubbing at the Edge: Using OTel Collector processors to automatically redact sensitive information (e.g., credit card numbers in span attributes) before it leaves the service network.

Query Optimization: Using "Searchable Snapshots" in ElasticSearch to reduce costs for older data while keeping it queryable.

Design Breakdown

Functional Requirements

Core Use Cases:

Track a single request flow across multiple heterogeneous microservices.

Visualize trace spans in a timeline (Gantt chart).

Search traces by specific tags (e.g., user_id, http.status_code, service_name).

Correlate logs and metrics with specific Trace IDs.

Scope Control:

In-scope: Instrumentation, collection, ingestion pipeline, storage, and basic UI query.

Out-of-scope: Automatic anomaly detection, automated root cause remediation, and long-term cold storage (Glacier/S3).

Non-Functional Requirements

Scale: Support 10M spans/sec ingestion.

Latency: End-to-end trace visibility within 1 minute (near real-time). Minimal application-side latency (<1ms).

Availability: 99.9% availability. Tracing system failure must never cause the primary application to fail (fail-open).

Consistency: Eventual consistency is acceptable; spans might arrive out of order.

Security: Encrypted transit (TLS) and Role-Based Access Control (RBAC) for viewing traces.

Estimation

Traffic: 10M spans/sec.

Sampling (1%): 100k spans/sec recorded.

Storage:

1 Span

\approx

1KB.

100k spans/sec

\times

1KB = 100MB/s (800Mbps) ingestion.

Daily storage: 100MB/s

\times

86,400s

\approx

8.6 TB/day.

7-day retention: ~60 TB (uncompressed). With ES overhead/replicas: ~120 TB.

Bandwidth: 100MB/s incoming to the Kafka cluster.

Blueprint

Concise Summary: A sidecar-based collection system that uses OpenTelemetry for instrumentation, Kafka for ingestion buffering, and ElasticSearch for indexed searching of spans.

Major Components:

OpenTelemetry SDK: Embedded in microservices to generate spans and propagate context headers (B3/W3C).

OTel Agent (Sidecar): Local collector that receives spans from the app via localhost to minimize network overhead.

Kafka Cluster: Acts as a high-throughput buffer to decouple data collection from storage indexing.

Trace Ingester: A scalable consumer group that processes spans from Kafka and writes them to ElasticSearch.

ElasticSearch: The primary storage and search engine for span data.

Simplicity Audit: This design avoids complex tail-based sampling logic in the MVP, relying on simple head-based sampling to control volume. It uses industry-standard OTel components to reduce custom code.

Architecture Decision Rationale:

Kafka: Essential because ElasticSearch cannot handle the massive, spiky write volume of raw spans without a buffer.

ElasticSearch: Best-in-class for full-text search on span tags and attributes.

OpenTelemetry: Provides a vendor-agnostic standard, ensuring the system can support any programming language.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery & Traffic Routing: Not majorly impactful for the tracing backend itself, but the Query UI should be behind a standard Load Balancer.

Security & Perimeter:

API Gateway: Must ensure traceparent headers are stripped or sanitized from external client requests to prevent "trace injection" attacks.

Authentication: OTel Collector should require an API Key or mTLS from the agents to prevent rogue data ingestion.

Service

Topology & Scaling:

OTel Agent: Deployed as a DaemonSet (Kubernetes) or Sidecar. Scales 1:1 with compute nodes.

OTel Collector: Stateless deployment in K8s, scaled horizontally based on CPU/Memory usage.

API Schema Design (Internal):

Span Export (gRPC):

Endpoint: Export(ExportTraceServiceRequest)

Protocol: OTLP (OpenTelemetry Line Protocol) over gRPC.

Idempotency: Not strictly required for telemetry.

Resilience:

Circuit Breaker: If the Agent/Collector is overloaded, the SDK should drop spans locally rather than slowing down the application.

Load Leveling: Kafka handles spikes in span generation (e.g., during a deployment or an incident).

Storage

Access Pattern: Write-heavy (100k writes/sec). Reads are infrequent but require low latency for debugging.

Database Table Design (ElasticSearch Index Mapping):

trace_id: Keyword (Index for lookup).

span_id: Keyword.

parent_span_id: Keyword.

operation_name: Keyword.

start_time: Date (Nanosecond precision).

duration_nano: Long.

tags: Flattened/Nested object for searchability (e.g., http.method, service.name).

Technical Selection: ElasticSearch/OpenSearch.

Rationale: Support for high-volume indexing and complex filtering on tags.

Distribution Logic:

Sharding: Index per day (e.g., traces-2023-10-27).

Rollover Policy: Use ILM (Index Lifecycle Management) to delete indices after 7 days.

Sharding Key: trace_id (ensures all spans for a trace are likely in the same shard if routing is used, though often time-based is preferred for logs).

Messaging

Purpose: Buffering and Decoupling.

Event Schema: OTLP Protobuf encoded spans.

Throughput & Partitioning:

Partition by trace_id to ensure ordering of spans within a trace for any downstream stateful processing (like tail-sampling in the future).

Technical Selection: Kafka.

Rationale: High throughput, durable retention, and excellent support for consumer groups.

Data Processing

Processing Model: The Trace Ingester is a simple stream processor.

Processing DAG: Kafka Source -> Batching -> ES Sink.

Correctness: "At-least-once" delivery. Duplicate spans are naturally handled by ES using the span_id as the document ID.

Technical Selection: Go-based custom consumer or OTel Collector Export.

Rationale: High performance and low memory footprint for simple IO-bound tasks.

Infrastructure (Optional)

Observability: The tracing system must trace itself (Meta-tracing). Monitor Kafka lag and ES indexing latency.

Distributed Coordination: Not required for simple ingestion, but Kafka handles its own partition balancing.

Wrap Up

Advanced Topics

Trade-offs (PACELC): We choose Availability/Partition Tolerance over Consistency. If a span is lost, it's better than the application crashing.

Optimization (Sampling):

Head-based: Sample at the entry point (API Gateway). The decision is propagated via the sampled flag in the trace header.

Cost Optimization: Store only "Summary" metrics (Aggregated duration/error counts) for 100% of traffic, but only full traces for 1%.

Bottleneck Analysis:

ES Indexing: If ES hits a limit, Kafka lag will increase. We scale ES by adding data nodes and increasing shard counts.

Hot Shards: Avoid using service_name as a routing key; stick to trace_id for even distribution.

Distinguishing Insights:

Context Propagation: Use the W3C Trace Context standard to ensure interoperability with third-party vendors or legacy systems.

Clock Skew: Implement logic in the UI to normalize span start times across services with drifting system clocks using parent-child relationship heuristics.