The Question
Design

Scalable Distributed Block Storage System

Design a high-performance distributed block storage service similar to Amazon EBS. The system must support thousands of virtual machines, providing raw block device interfaces with sub-millisecond latency and strong read-after-write consistency. Key challenges include managing volume-to-node mappings at scale, ensuring high durability through replication across failure domains, and implementing efficient, non-blocking point-in-time snapshots. Discuss your strategy for handling storage node failures, minimizing the impact of 'noisy neighbors' in a multi-tenant environment, and optimizing the data path for high-IOPS workloads.
NVMe
RDMA
Etcd
S3
gRPC
SPDK
Chain Replication
CRC32
LSM-Tree
HMAC
Questions & Insights

Clarifying Questions

What is the typical block size and I/O pattern? (Assumption: 4KB - 256KB blocks; mostly random I/O for database workloads, sequential for logs).
What are the target latency SLAs? (Assumption: < 1ms for sub-4KB writes, < 5ms for 99th percentile).
Do we support multi-attach (multiple VMs to one volume)? (Assumption: No, single-writer model to simplify consistency and locking for the MVP).
What is the scale of the system? (Assumption: 100,000+ volumes, 100PB+ total storage, 1M+ aggregate IOPS).
How is durability defined? (Assumption: 99.999999999% (11 nines) durability, achieved via 3-way replication across different racks/failure domains).

Thinking Process

Core Bottleneck: How to achieve sub-millisecond write latency while ensuring strong consistency across 3 replicas?
Step 1: The Write Path: How does a block move from the VM guest OS to the physical disks? We use a Chain Replication or Primary-Backup model with synchronous writes to ensure "Read-after-Write" consistency.
Step 2: Metadata Management: How do we map VolumeID + Offset to a physical NodeID + DiskAddress? We use a partitioned, highly available Metadata Service (likely backed by Etcd/Zookeeper) and client-side caching to avoid the metadata lookup bottleneck.
Step 3: Fault Tolerance: How does the system react when a Storage Node dies? We use a Placement Service and Health Monitor to detect failures, invalidate old leases, and trigger background re-replication (re-sharding) to restore the durability factor.
Step 4: Snapshotting: How to take a point-in-time backup without stopping I/O? We use Redirect-on-Write (RoW) or Copy-on-Write (CoW) at the block level, pushing frozen data to Object Storage (S3).

Bonus Points

Checksumming & Scrubbing: Implement end-to-end CRC32 checksums to detect "bit rot" and "silent data corruption" during transit and at rest.
Log-Structured Storage Engine: Use an Append-only storage format on the storage nodes (similar to LSM-trees) to turn random writes into sequential writes, maximizing SSD/NVMe lifespan and performance.
NVMe-over-Fabrics (NVMe-oF): Mention utilizing RDMA (Remote Direct Memory Access) to bypass the kernel stack and reduce network latency for high-performance volume classes.
Admission Control: Implement token-bucket rate limiting at the Client Driver level to prevent "noisy neighbor" effects and ensure IOPS/Throughput isolation.
Design Breakdown

Functional Requirements

Core Use Cases:
Provision/Delete volumes of variable sizes.
Attach/Detach volumes to VM instances.
Read/Write fixed-size blocks (Strong Consistency).
Create snapshots and restore volumes.
Scope Control:
In-Scope: Control plane (management), Data plane (I/O), Replication, and Snapshotting.
Out-of-Scope: Filesystem-level logic (handled by Guest OS), Multi-region synchronous replication (too high latency for block storage).

Non-Functional Requirements

Scale: 100k+ active volumes; petabyte-scale storage.
Latency: Single-digit millisecond latency for I/O operations.
Availability & Reliability: 99.99% availability; 11 nines durability.
Consistency: Strict Read-after-Write (Linearizability).
Fault Tolerance: Automatic recovery from disk, node, and rack failures.
Security: Data encryption at rest and in transit (AES-256).

Estimation

Traffic: 100,000 volumes * 1,000 IOPS average = 100M IOPS cluster-wide.
Storage: 100,000 volumes * 1TB average = 100PB raw capacity (300PB with 3x replication).
Bandwidth: 100M IOPS * 4KB/block = 400 GB/s aggregate throughput.
Metadata: Each 4KB block mapping = 32 bytes. For 100PB, mapping is astronomical. Optimization: Manage "Extents" (e.g., 1GB chunks) in the central metadata, and local offsets within Storage Nodes.

Blueprint

Concise Summary: A distributed system consisting of a Control Plane for orchestration and a Data Plane for high-speed I/O. The system uses a Client Driver (on the VM Host) to route I/O to a Primary Storage Node, which replicates to Secondary Nodes via Chain Replication.
Major Components:
Control Plane: Manages volume lifecycle, placement logic, and health monitoring.
Metadata Store: A consistent KV store (Etcd) holding volume-to-node mappings.
Storage Nodes: High-performance servers managing local NVMe/SSD disks and block persistence.
Client Driver: A kernel-level or user-space driver on the VM host that exposes the block device.
Task Queue: Handles asynchronous operations like volume creation and snapshots.
Simplicity Audit: This design avoids complex distributed lock managers by using a single-writer lease model per volume, ensuring consistency without the overhead of multi-node coordination for every I/O.
Architecture Decision Rationale:
Chain Replication: Chosen over Paxos for the data plane to maximize throughput (all nodes use full bandwidth) and simplify the write path.
Extent-based Mapping: Reduces metadata size by 10,000x compared to per-block mapping.
S3 for Snapshots: Leverages cheap, durable object storage for cold data rather than keeping snapshots on expensive NVMe.

High Level Architecture

Sub-system Deep Dive

Service

Topology & Scaling:
Control Plane: Stateless microservices scaled horizontally.
Data Plane: Storage nodes grouped into "Cells" (e.g., 1000 nodes) to limit blast radius.
API Schema Design:
POST /volumes: Create volume (Size, IOPS, Zone).
POST /attachments: Attach volume to InstanceID.
POST /snapshots: Trigger async snapshot.
Protocol: gRPC for internal control plane communication; custom binary protocol for data plane I/O to minimize overhead.
Resilience & Reliability:
Leases: Storage nodes hold a lease from the Metadata Store. If a heartbeat fails, the lease expires, and the Control Plane promotes a new Primary.
Circuit Breaker: The Client Driver will stop sending I/O to a suspected dead Primary and request a new mapping from the Control Plane.

Storage

Access Pattern: 70/30 Read/Write ratio (typical server workload); high burstiness.
Database Table Design (Metadata DB):
Volumes: VolumeID (PK), UserID, Size, State (Creating, Available, In-Use), CreatedAt.
Mappings: VolumeID, PartitionID, NodeList (Primary, Secondary, Tertiary), LeaseExpiry.
Technical Selection:
Metadata: Etcd (Strong consistency needed for lease management and mapping).
Data Nodes: Custom C++/Rust engine using SPDK (Storage Performance Development Kit) to interact with NVMe directly via Zero-copy.
Distribution Logic:
Volumes are split into Extents (e.g., 1GB).
Extents are distributed across nodes based on rack awareness (ensuring no two replicas share the same Top-of-Rack switch).

Cache

Purpose & Justification: Reduce latency for metadata lookups.
Key-Value Schema:
Client Side Cache: Maps VolumeID -> [NodeIP_Primary, NodeIP_Secondary...].
TTL: Short (e.g., 30s) or invalidated via push notification from Control Plane if a failure occurs.
Failure Handling: On cache miss or "Connection Refused" from a storage node, the driver bypasses the cache and hits the Control Plane.

Messaging

Purpose & Decoupling: Asynchronous execution of long-running tasks like volume formatting, zeroing, and snapshot uploading.
Event / Topic Schema:
Topic: storage-tasks.
Payload: {task_type: "SNAPSHOT", volume_id: "vol-123", destination: "s3://bucket/snap-1"}.
Technical Selection: SQS or RabbitMQ. High throughput isn't required here; reliability of task delivery is key.

Data Processing

Processing Model: Snapshot Worker (Background process).
Processing Logic: Reads blocks from the Storage Node, compresses them, and uploads them to S3.
Technical Selection: Custom Go/Rust microservice.
Correctness: Uses "Dirty Bitmaps" on the storage node to track which blocks changed since the last snapshot (Incremental Snapshots).
Wrap Up

Advanced Topics

Trade-offs (PACELC): In the face of a network partition, we choose Consistency (C) over Availability (A). A block storage system cannot allow "split-brain" (two VMs writing to the same block differently), as it would corrupt the filesystem.
Reliability & Failure Handling:
Primary Failure: Detected by heartbeat timeout. The Placement Engine marks the node "Down", elects the first Secondary as the new Primary, and updates the Metadata Store.
Data Scrubbing: Background process reads all blocks periodically, recalculates checksums, and compares them with replicas to fix "bit rot."
Bottleneck Analysis:
Hot Partitions: If one volume is extremely busy, it can saturate a node's 10Gbps/25Gbps NIC. Mitigation: Shard a single large volume across multiple Storage Node sets.
Security:
Isolation: Each VM's Client Driver only has credentials to access its specific VolumeIDs via HMAC-signed requests to Storage Nodes.