DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
The Question
Design

Scalable Distributed Block Storage System

Design a high-performance distributed block storage service similar to Amazon EBS. The system must support thousands of virtual machines, providing raw block device interfaces with sub-millisecond latency and strong read-after-write consistency. Key challenges include managing volume-to-node mappings at scale, ensuring high durability through replication across failure domains, and implementing efficient, non-blocking point-in-time snapshots. Discuss your strategy for handling storage node failures, minimizing the impact of 'noisy neighbors' in a multi-tenant environment, and optimizing the data path for high-IOPS workloads.
NVMe
RDMA
Etcd
S3
gRPC
SPDK
Chain Replication
CRC32
LSM-Tree
HMAC
Questions & Insights

Clarifying Questions

What is the typical block size and I/O pattern? (Assumption: 4KB - 256KB blocks; mostly random I/O for database workloads, sequential for logs).
What are the target latency SLAs? (Assumption: < 1ms for sub-4KB writes, < 5ms for 99th percentile).
Do we support multi-attach (multiple VMs to one volume)? (Assumption: No, single-writer model to simplify consistency and locking for the MVP).
What is the scale of the system? (Assumption: 100,000+ volumes, 100PB+ total storage, 1M+ aggregate IOPS).
How is durability defined? (Assumption: 99.999999999% (11 nines) durability, achieved via 3-way replication across different racks/failure domains).

Thinking Process

Core Bottleneck: How to achieve sub-millisecond write latency while ensuring strong consistency across 3 replicas?
Step 1: The Write Path: How does a block move from the VM guest OS to the physical disks? We use a Chain Replication or Primary-Backup model with synchronous writes to ensure "Read-after-Write" consistency.
Step 2: Metadata Management: How do we map VolumeID + Offset to a physical NodeID + DiskAddress? We use a partitioned, highly available Metadata Service (likely backed by Etcd/Zookeeper) and client-side caching to avoid the metadata lookup bottleneck.
Step 3: Fault Tolerance: How does the system react when a Storage Node dies? We use a Placement Service and Health Monitor to detect failures, invalidate old leases, and trigger background re-replication (re-sharding) to restore the durability factor.
Step 4: Snapshotting: How to take a point-in-time backup without stopping I/O? We use Redirect-on-Write (RoW) or Copy-on-Write (CoW) at the block level, pushing frozen data to Object Storage (S3).

Bonus Points

Checksumming & Scrubbing: Implement end-to-end CRC32 checksums to detect "bit rot" and "silent data corruption" during transit and at rest.
Log-Structured Storage Engine: Use an Append-only storage format on the storage nodes (similar to LSM-trees) to turn random writes into sequential writes, maximizing SSD/NVMe lifespan and performance.
NVMe-over-Fabrics (NVMe-oF): Mention utilizing RDMA (Remote Direct Memory Access) to bypass the kernel stack and reduce network latency for high-performance volume classes.
Admission Control: Implement token-bucket rate limiting at the Client Driver level to prevent "noisy neighbor" effects and ensure IOPS/Throughput isolation.
Design Breakdown

Functional Requirements

Core Use Cases:
Provision/Delete volumes of variable sizes.
Attach/Detach volumes to VM instances.
Read/Write fixed-size blocks (Strong Consistency).
Create snapshots and restore volumes.
Scope Control:
In-Scope: Control plane (management), Data plane (I/O), Replication, and Snapshotting.
Out-of-Scope: Filesystem-level logic (handled by Guest OS), Multi-region synchronous replication (too high latency for block storage).

Non-Functional Requirements

Scale: 100k+ active volumes; petabyte-scale storage.
Latency: Single-digit millisecond latency for I/O operations.
Availability & Reliability: 99.99% availability; 11 nines durability.
Consistency: Strict Read-after-Write (Linearizability).
Fault Tolerance: Automatic recovery from disk, node, and rack failures.
Security: Data encryption at rest and in transit (AES-256).

Estimation

Traffic: 100,000 volumes * 1,000 IOPS average = 100M IOPS cluster-wide.
Storage: 100,000 volumes * 1TB average = 100PB raw capacity (300PB with 3x replication).
Bandwidth: 100M IOPS * 4KB/block = 400 GB/s aggregate throughput.
Metadata: Each 4KB block mapping = 32 bytes. For 100PB, mapping is astronomical. Optimization: Manage "Extents" (e.g., 1GB chunks) in the central metadata, and local offsets within Storage Nodes.

Blueprint

Concise Summary: A distributed system consisting of a Control Plane for orchestration and a Data Plane for high-speed I/O. The system uses a Client Driver (on the VM Host) to route I/O to a Primary Storage Node, which replicates to Secondary Nodes via Chain Replication.
Major Components:
Control Plane: Manages volume lifecycle, placement logic, and health monitoring.
Metadata Store: A consistent KV store (Etcd) holding volume-to-node mappings.
Storage Nodes: High-performance servers managing local NVMe/SSD disks and block persistence.
Client Driver: A kernel-level or user-space driver on the VM host that exposes the block device.
Task Queue: Handles asynchronous operations like volume creation and snapshots.
Simplicity Audit: This design avoids complex distributed lock managers by using a single-writer lease model per volume, ensuring consistency without the overhead of multi-node coordination for every I/O.
Architecture Decision Rationale:
Chain Replication: Chosen over Paxos for the data plane to maximize throughput (all nodes use full bandwidth) and simplify the write path.
Extent-based Mapping: Reduces metadata size by 10,000x compared to per-block mapping.
S3 for Snapshots: Leverages cheap, durable object storage for cold data rather than keeping snapshots on expensive NVMe.

High Level Architecture

Sub-system Deep Dive

Service

Topology & Scaling:
Control Plane: Stateless microservices scaled horizontally.
Data Plane: Storage nodes grouped into "Cells" (e.g., 1000 nodes) to limit blast radius.
API Schema Design:
POST /volumes: Create volume (Size, IOPS, Zone).
POST /attachments: Attach volume to InstanceID.
POST /snapshots: Trigger async snapshot.
Protocol: gRPC for internal control plane communication; custom binary protocol for data plane I/O to minimize overhead.
Resilience & Reliability:
Leases: Storage nodes hold a lease from the Metadata Store. If a heartbeat fails, the lease expires, and the Control Plane promotes a new Primary.
Circuit Breaker: The Client Driver will stop sending I/O to a suspected dead Primary and request a new mapping from the Control Plane.

Storage

Access Pattern: 70/30 Read/Write ratio (typical server workload); high burstiness.
Database Table Design (Metadata DB):
Volumes: VolumeID (PK), UserID, Size, State (Creating, Available, In-Use), CreatedAt.
Mappings: VolumeID, PartitionID, NodeList (Primary, Secondary, Tertiary), LeaseExpiry.
Technical Selection:
Metadata: Etcd (Strong consistency needed for lease management and mapping).
Data Nodes: Custom C++/Rust engine using SPDK (Storage Performance Development Kit) to interact with NVMe directly via Zero-copy.
Distribution Logic:
Volumes are split into Extents (e.g., 1GB).
Extents are distributed across nodes based on rack awareness (ensuring no two replicas share the same Top-of-Rack switch).

Cache

Purpose & Justification: Reduce latency for metadata lookups.
Key-Value Schema:
Client Side Cache: Maps VolumeID -> [NodeIP_Primary, NodeIP_Secondary...].
TTL: Short (e.g., 30s) or invalidated via push notification from Control Plane if a failure occurs.
Failure Handling: On cache miss or "Connection Refused" from a storage node, the driver bypasses the cache and hits the Control Plane.

Messaging

Purpose & Decoupling: Asynchronous execution of long-running tasks like volume formatting, zeroing, and snapshot uploading.
Event / Topic Schema:
Topic: storage-tasks.
Payload: {task_type: "SNAPSHOT", volume_id: "vol-123", destination: "s3://bucket/snap-1"}.
Technical Selection: SQS or RabbitMQ. High throughput isn't required here; reliability of task delivery is key.

Data Processing

Processing Model: Snapshot Worker (Background process).
Processing Logic: Reads blocks from the Storage Node, compresses them, and uploads them to S3.
Technical Selection: Custom Go/Rust microservice.
Correctness: Uses "Dirty Bitmaps" on the storage node to track which blocks changed since the last snapshot (Incremental Snapshots).
Wrap Up

Advanced Topics

Trade-offs (PACELC): In the face of a network partition, we choose Consistency (C) over Availability (A). A block storage system cannot allow "split-brain" (two VMs writing to the same block differently), as it would corrupt the filesystem.
Reliability & Failure Handling:
Primary Failure: Detected by heartbeat timeout. The Placement Engine marks the node "Down", elects the first Secondary as the new Primary, and updates the Metadata Store.
Data Scrubbing: Background process reads all blocks periodically, recalculates checksums, and compares them with replicas to fix "bit rot."
Bottleneck Analysis:
Hot Partitions: If one volume is extremely busy, it can saturate a node's 10Gbps/25Gbps NIC. Mitigation: Shard a single large volume across multiple Storage Node sets.
Security:
Isolation: Each VM's Client Driver only has credentials to access its specific VolumeIDs via HMAC-signed requests to Storage Nodes.