DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
The Question
Design

Scalable Distributed Tracing System Design

Design a high-throughput distributed tracing system capable of capturing, storing, and visualizing service request flows across thousands of microservices. The system must handle over 1 million spans per second with minimal impact on application performance. Address the challenges of non-blocking data collection, efficient storage for long-term retention versus short-term search, and how to handle trace reconstruction across heterogeneous service environments. Discuss your strategy for sampling, search indexing, and ensuring the system remains reliable during massive traffic spikes.
gRPC
Kafka
Cassandra
ElasticSearch
W3C Trace Context
Sidecar Pattern
NoSQL
Questions & Insights

Clarifying Questions

Scale & Throughput: What is the expected volume of spans per second?
Assumption: 1M spans per second at peak.
Data Retention: How long do we need to store the traces?
Assumption: 7 days of searchable data; anything older is archived or dropped.
Sampling Strategy: Do we need to capture 100% of traces, or is sampling acceptable?
Assumption: Support for head-based sampling (at the start of a trace) for the MVP, with architectural hooks for tail-based sampling.
Latency Overhead: What is the maximum performance impact allowed on the instrumented services?
Assumption: The tracing SDK/Agent must introduce < 1ms of overhead to the application's critical path.
Search Requirements: Do we need to search by specific tags (e.g., user_id, http.status_code) or just Trace ID?
Assumption: Full-text and tag-based search are required for debugging.

Thinking Process

Low-Overhead Collection: How do we ensure tracing doesn't crash the production services? (Answer: Async batching in agents and UDP/gRPC out-of-process).
Write-Heavy Ingestion: How do we handle 1M+ writes per second without data loss? (Answer: Distributed message queue for load leveling).
Efficient Storage: How do we store traces for fast lookup by ID and flexible search by tags? (Answer: Dual-storage strategy: Wide-column for raw spans, Search Engine for indexing).
Context Propagation: How do we maintain trace identity across heterogeneous services? (Answer: Standardized headers like W3C Trace Context).

Bonus Points

Tail-based Sampling: Implementing a mechanism to keep 100% of "interesting" traces (errors, high latency) while sampling 1% of "healthy" traces to save storage costs.
Clock Skew Correction: Algorithmic adjustment of span timestamps when parent-child spans come from different servers with unsynchronized clocks.
Adaptive Sampling: Dynamically adjusting sampling rates based on service traffic patterns to maintain a constant ingest volume.
PII Scrubbing at Source: Edge-level regex-based masking of sensitive data (passwords, credit cards) within spans before they leave the application host.
Design Breakdown

Functional Requirements

Core Use Cases:
Generate and propagate unique Trace IDs across service boundaries.
Record Span data (start time, end time, tags, logs).
Search traces by Trace ID, Service Name, Operation, or custom tags.
Visualize a trace as a Gantt chart showing parent-child relationships.
Scope Control:
In-scope: Distributed context propagation, span ingestion, storage, and query API.
Out-of-scope: Metrics/Logging integration (though traces can link to them), service mesh sidecar injection logic.

Non-Functional Requirements

Scale: 1M spans/sec; ~100GB/hour of raw data.
Latency: Querying a Trace ID should take < 500ms. Ingestion must be non-blocking.
Availability: High availability for the ingestion pipeline (loss of spans is better than slowing down production traffic).
Consistency: Eventual consistency is acceptable; a trace might take 10-30 seconds to appear in search.
Fault Tolerance: Use of buffers (Kafka) to handle downstream storage outages.

Estimation

Traffic: 1M spans/sec * 1KB per span = 1 GB/sec ingestion.
Storage: 1 GB/sec 3600s 24h * 7 days ≈ 600 TB (pre-compression).
Bandwidth: ~8 Gbps incoming network traffic to the collector cluster.

Blueprint

Concise Summary: A sidecar-based collection system that pushes spans to a Kafka-buffered ingestion pipeline, storing raw spans in Cassandra for fast ID-based retrieval and indexing metadata in ElasticSearch for complex queries.
Major Components:
Tracing SDK/Agent: Library and sidecar process that collects spans asynchronously from application code.
Collector Service: Stateless gRPC servers that receive spans, validate them, and push to Kafka.
Kafka: Acts as a buffer to protect storage layers from traffic spikes.
Indexer Service: Consumes from Kafka, writes raw data to Cassandra, and sends searchable tags to ElasticSearch.
Query Service: API to fetch trace data by ID (from Cassandra) or search (from ElasticSearch).
Simplicity Audit: This architecture uses "Storage per Access Pattern." While running two databases (Cassandra/ES) adds complexity, it is the standard "Simplest" way to handle 1M+ writes/sec with full-text search capabilities without causing DB lockups.
Architecture Decision Rationale:
Why this?: Decoupling ingestion from storage via Kafka ensures that high-traffic bursts in production don't crash the tracing system. Cassandra provides linear write scaling for the bulk data.
Functional Satisfaction: Meets all requirements for collection, storage, and visualization.
Non-functional Satisfaction: High write-availability (AP system), horizontally scalable.

High Level Architecture

Sub-system Deep Dive

Service

Topology & Scaling
Collector: Stateless gRPC nodes. Scale based on CPU/Network throughput.
Indexer: Scaled based on Kafka consumer lag.
Query Service: Low-traffic, scaled for availability.
API Schema Design
Ingest API (Internal): PostSpans(stream SpanList) via gRPC for high performance.
Query API:
GET /trace/{trace_id}: Returns full span tree.
GET /search?service=X&tag=error:true: Returns list of Trace IDs.
Resilience & Reliability
UDP/gRPC Async: Agents use non-blocking calls. If the agent's internal buffer is full, it drops spans rather than slowing the app.
Circuit Breaker: Collectors stop accepting spans if Kafka is down.

Storage

Access Pattern
High write (1M/sec).
Trace ID lookup: Key-value pattern.
Search: Multi-dimensional filtering.
Database Table Design
Cassandra (spans):
trace_id (Partition Key)
span_id (Clustering Key)
parent_id, operation_name, start_time, duration, tags_blob
Technical Selection
Cassandra: Optimized for high-volume writes.
ElasticSearch: Optimized for inverted-index searches on tags like http.status or user_id.
Distribution Logic
Cassandra partitions by trace_id to ensure all spans for one request are co-located on the same shard for fast retrieval.

Messaging

Purpose & Decoupling: Kafka decouples the high-velocity ingestion from the slower indexing and storage operations.
Throughput & Partitioning: Partition by trace_id to ensure ordering of spans for a single trace within the indexer, though spans are generally independent.
Technical Selection: Kafka. Required for its high-throughput persistent log capabilities.

Data Processing

Processing Model: Indexer Service acts as a stream processor.
Processing DAG: Source (Kafka) -> Extract Searchable Tags -> Sink (ES) -> Save Raw Span (Cassandra).
Correctness Guarantees: At-least-once delivery. Duplicate spans (if any) are naturally handled by Cassandra's UPSERT behavior based on trace_id + span_id.
Wrap Up

Advanced Topics

Trade-offs: Consistency vs. Availability. We prioritize Availability. If the Indexer lags, developers see "Trace Not Found" for a few seconds. This is acceptable for a debugging tool.
Reliability: If ElasticSearch fails, users can still look up traces by ID directly from Cassandra, allowing for "degraded but functional" debugging.
Bottleneck Analysis:
Hot Shards: A single extremely large trace (e.g., a batch job with 100k spans) can create a hot partition in Cassandra. Mitigation: Limit max spans per trace ID.
ES Index Bloat: Too many unique tags can explode the ES index. Mitigation: Implement a whitelist of searchable tags.
Security: Context Propagation Safety. Ensure that Trace-IDs are stripped/validated at the API Gateway to prevent external users from injecting/spoofing trace headers to probe internal architecture.