The Question

Scalable Distributed Block Storage System

Design a high-performance distributed block storage service similar to Amazon EBS. The system must support thousands of virtual machines, providing raw block device interfaces with sub-millisecond latency and strong read-after-write consistency. Key challenges include managing volume-to-node mappings at scale, ensuring high durability through replication across failure domains, and implementing efficient, non-blocking point-in-time snapshots. Discuss your strategy for handling storage node failures, minimizing the impact of 'noisy neighbors' in a multi-tenant environment, and optimizing the data path for high-IOPS workloads.

NVMe

RDMA

Etcd

gRPC

SPDK

Chain Replication

CRC32

LSM-Tree

HMAC

Questions & Insights

Clarifying Questions

What is the typical block size and I/O pattern? (Assumption: 4KB - 256KB blocks; mostly random I/O for database workloads, sequential for logs).

What are the target latency SLAs? (Assumption: < 1ms for sub-4KB writes, < 5ms for 99th percentile).

Do we support multi-attach (multiple VMs to one volume)? (Assumption: No, single-writer model to simplify consistency and locking for the MVP).

What is the scale of the system? (Assumption: 100,000+ volumes, 100PB+ total storage, 1M+ aggregate IOPS).

How is durability defined? (Assumption: 99.999999999% (11 nines) durability, achieved via 3-way replication across different racks/failure domains).

Thinking Process

Core Bottleneck: How to achieve sub-millisecond write latency while ensuring strong consistency across 3 replicas?

Step 1: The Write Path: How does a block move from the VM guest OS to the physical disks? We use a Chain Replication or Primary-Backup model with synchronous writes to ensure "Read-after-Write" consistency.

Step 2: Metadata Management: How do we map VolumeID + Offset to a physical NodeID + DiskAddress? We use a partitioned, highly available Metadata Service (likely backed by Etcd/Zookeeper) and client-side caching to avoid the metadata lookup bottleneck.

Step 3: Fault Tolerance: How does the system react when a Storage Node dies? We use a Placement Service and Health Monitor to detect failures, invalidate old leases, and trigger background re-replication (re-sharding) to restore the durability factor.

Step 4: Snapshotting: How to take a point-in-time backup without stopping I/O? We use Redirect-on-Write (RoW) or Copy-on-Write (CoW) at the block level, pushing frozen data to Object Storage (S3).

Bonus Points

Checksumming & Scrubbing: Implement end-to-end CRC32 checksums to detect "bit rot" and "silent data corruption" during transit and at rest.

Log-Structured Storage Engine: Use an Append-only storage format on the storage nodes (similar to LSM-trees) to turn random writes into sequential writes, maximizing SSD/NVMe lifespan and performance.

NVMe-over-Fabrics (NVMe-oF): Mention utilizing RDMA (Remote Direct Memory Access) to bypass the kernel stack and reduce network latency for high-performance volume classes.

Admission Control: Implement token-bucket rate limiting at the Client Driver level to prevent "noisy neighbor" effects and ensure IOPS/Throughput isolation.

Design Breakdown

Functional Requirements

Core Use Cases:

Provision/Delete volumes of variable sizes.

Attach/Detach volumes to VM instances.

Read/Write fixed-size blocks (Strong Consistency).

Create snapshots and restore volumes.

Scope Control:

In-Scope: Control plane (management), Data plane (I/O), Replication, and Snapshotting.

Out-of-Scope: Filesystem-level logic (handled by Guest OS), Multi-region synchronous replication (too high latency for block storage).

Non-Functional Requirements

Scale: 100k+ active volumes; petabyte-scale storage.

Latency: Single-digit millisecond latency for I/O operations.

Availability & Reliability: 99.99% availability; 11 nines durability.

Consistency: Strict Read-after-Write (Linearizability).

Fault Tolerance: Automatic recovery from disk, node, and rack failures.

Security: Data encryption at rest and in transit (AES-256).

Estimation

Traffic: 100,000 volumes * 1,000 IOPS average = 100M IOPS cluster-wide.

Storage: 100,000 volumes * 1TB average = 100PB raw capacity (300PB with 3x replication).

Bandwidth: 100M IOPS * 4KB/block = 400 GB/s aggregate throughput.

Metadata: Each 4KB block mapping = 32 bytes. For 100PB, mapping is astronomical. Optimization: Manage "Extents" (e.g., 1GB chunks) in the central metadata, and local offsets within Storage Nodes.

Blueprint

Concise Summary: A distributed system consisting of a Control Plane for orchestration and a Data Plane for high-speed I/O. The system uses a Client Driver (on the VM Host) to route I/O to a Primary Storage Node, which replicates to Secondary Nodes via Chain Replication.

Major Components:

Control Plane: Manages volume lifecycle, placement logic, and health monitoring.

Metadata Store: A consistent KV store (Etcd) holding volume-to-node mappings.

Storage Nodes: High-performance servers managing local NVMe/SSD disks and block persistence.

Client Driver: A kernel-level or user-space driver on the VM host that exposes the block device.

Task Queue: Handles asynchronous operations like volume creation and snapshots.

Simplicity Audit: This design avoids complex distributed lock managers by using a single-writer lease model per volume, ensuring consistency without the overhead of multi-node coordination for every I/O.

Architecture Decision Rationale:

Chain Replication: Chosen over Paxos for the data plane to maximize throughput (all nodes use full bandwidth) and simplify the write path.

Extent-based Mapping: Reduces metadata size by 10,000x compared to per-block mapping.

S3 for Snapshots: Leverages cheap, durable object storage for cold data rather than keeping snapshots on expensive NVMe.

High Level Architecture

Sub-system Deep Dive

Service

Topology & Scaling:

Control Plane: Stateless microservices scaled horizontally.

Data Plane: Storage nodes grouped into "Cells" (e.g., 1000 nodes) to limit blast radius.

API Schema Design:

POST /volumes: Create volume (Size, IOPS, Zone).

POST /attachments: Attach volume to InstanceID.

POST /snapshots: Trigger async snapshot.

Protocol: gRPC for internal control plane communication; custom binary protocol for data plane I/O to minimize overhead.

Resilience & Reliability:

Leases: Storage nodes hold a lease from the Metadata Store. If a heartbeat fails, the lease expires, and the Control Plane promotes a new Primary.

Circuit Breaker: The Client Driver will stop sending I/O to a suspected dead Primary and request a new mapping from the Control Plane.

Storage

Access Pattern: 70/30 Read/Write ratio (typical server workload); high burstiness.

Database Table Design (Metadata DB):

Volumes: VolumeID (PK), UserID, Size, State (Creating, Available, In-Use), CreatedAt.

Mappings: VolumeID, PartitionID, NodeList (Primary, Secondary, Tertiary), LeaseExpiry.

Technical Selection:

Metadata: Etcd (Strong consistency needed for lease management and mapping).

Data Nodes: Custom C++/Rust engine using SPDK (Storage Performance Development Kit) to interact with NVMe directly via Zero-copy.

Distribution Logic:

Volumes are split into Extents (e.g., 1GB).

Extents are distributed across nodes based on rack awareness (ensuring no two replicas share the same Top-of-Rack switch).

Cache

Purpose & Justification: Reduce latency for metadata lookups.

Key-Value Schema:

Client Side Cache: Maps VolumeID -> [NodeIP_Primary, NodeIP_Secondary...].

TTL: Short (e.g., 30s) or invalidated via push notification from Control Plane if a failure occurs.

Failure Handling: On cache miss or "Connection Refused" from a storage node, the driver bypasses the cache and hits the Control Plane.

Messaging

Purpose & Decoupling: Asynchronous execution of long-running tasks like volume formatting, zeroing, and snapshot uploading.

Event / Topic Schema:

Topic: storage-tasks.

Payload: {task_type: "SNAPSHOT", volume_id: "vol-123", destination: "s3://bucket/snap-1"}.

Technical Selection: SQS or RabbitMQ. High throughput isn't required here; reliability of task delivery is key.

Data Processing

Processing Model: Snapshot Worker (Background process).

Processing Logic: Reads blocks from the Storage Node, compresses them, and uploads them to S3.

Technical Selection: Custom Go/Rust microservice.

Correctness: Uses "Dirty Bitmaps" on the storage node to track which blocks changed since the last snapshot (Incremental Snapshots).

Wrap Up

Advanced Topics

Trade-offs (PACELC): In the face of a network partition, we choose Consistency (C) over Availability (A). A block storage system cannot allow "split-brain" (two VMs writing to the same block differently), as it would corrupt the filesystem.

Reliability & Failure Handling:

Primary Failure: Detected by heartbeat timeout. The Placement Engine marks the node "Down", elects the first Secondary as the new Primary, and updates the Metadata Store.

Data Scrubbing: Background process reads all blocks periodically, recalculates checksums, and compares them with replicas to fix "bit rot."

Bottleneck Analysis:

Hot Partitions: If one volume is extremely busy, it can saturate a node's 10Gbps/25Gbps NIC. Mitigation: Shard a single large volume across multiple Storage Node sets.

Security:

Isolation: Each VM's Client Driver only has credentials to access its specific VolumeIDs via HMAC-signed requests to Storage Nodes.