The Question

Scalable Object Storage System (S3-compatible)

Design a high-durability, exabyte-scale blob storage system similar to Amazon S3. The system must support basic CRUD operations for unstructured data while ensuring 11 9s of durability and strong consistency for read-after-write. Explain how you would manage massive amounts of metadata, optimize for storage costs at scale, and handle hardware failures or data corruption without service interruption.

NoSQL

Erasure Coding

Cassandra

Reed-Solomon

LSM-tree

RocksDB

HMAC

REST

AES-256

Questions & Insights

Clarifying Questions

Scale of storage and traffic? (Assumption: 10PB of data, 100 billion objects, 10k Write QPS, 50k Read QPS for MVP).

What is the target durability? (Assumption: 11 9s (99.999999999%) is the industry standard for S3, but for MVP we aim for 99.9999% through replication/erasure coding).

Consistency model? (Assumption: Strong consistency for Put/Delete as per modern S3 standards).

Maximum object size? (Assumption: 5GB for a single PUT, multipart upload for anything larger).

Data distribution? (Assumption: Multi-AZ within a single region for the MVP).

Thinking Process

Metadata vs. Data: Decouple object metadata (name, size, owner, permissions) from the actual byte stream (blob) to allow independent scaling.

Durability Strategy: How do we ensure we don't lose data? Use Erasure Coding (e.g., Reed-Solomon) instead of simple replication to save 50% storage space while maintaining high durability.

Data Placement: How to find where a blob is? Use a dedicated Metadata Service backed by a partitioned NoSQL store and a Placement Service to manage disk health.

Progressive flow: Start with a stateless API layer -> Decouple metadata lookup -> Implement a chunk-based storage engine with erasure coding -> Add a background "Scrubber" for data integrity.

Bonus Points

Bit Rot Detection: Implement a background "Scrubber" service that continuously reads data and verifies checksums to detect and repair silent data corruption.

Placement Groups: Awareness of physical topology (Rack, Row, Power Unit) to ensure that a single hardware failure doesn't take out enough erasure-coded shards to cause data loss.

Zero-Copy Transfers: Use sendfile or splice syscalls in the storage nodes to move data from disk to network buffers without CPU-heavy copying.

LSM-tree based Metadata: Use an LSM-tree (like RocksDB) for local metadata on storage nodes to handle high-frequency small writes and updates efficiently.

Design Breakdown

Functional Requirements

Core Use Cases:

CreateBucket(name): Initialize a logical container.

PutObject(bucket, key, data): Store a blob.

GetObject(bucket, key): Retrieve a blob.

DeleteObject(bucket, key): Remove a blob.

ListObjects(bucket, prefix): Search keys within a bucket.

Scope Control:

In-scope: High durability storage, metadata management, basic REST API.

Out-of-scope: CDN integration, versioning, life-cycle policies, cross-region replication.

Non-Functional Requirements

Scale: Support exabyte-scale total capacity and billions of objects.

Latency: Sub-100ms for GET/PUT of small objects (<10MB).

Availability & Reliability: 99.99% availability; high fault tolerance (node/rack failure).

Consistency: Strong consistency for read-after-write.

Security: IAM-based access control and encryption at rest.

Estimation

Traffic:

Writes: 10k QPS * 1MB (avg) = 10 GB/s ingress.

Reads: 50k QPS * 1MB (avg) = 50 GB/s egress.

Storage:

10PB raw data.

Using (6+3) Erasure Coding (1.5x overhead) = 15PB total disk space.

Metadata:

100B objects * 1KB metadata each = 100TB metadata storage.

Blueprint

The architecture follows a decoupled "Metadata-Data" split. The API Gateway handles authentication and routing. The Metadata Service manages object-to-location mapping using a distributed NoSQL store. The Storage Service manages the physical disks, utilizing Erasure Coding to shard data across multiple nodes for durability. A Placement Service acts as the brain, monitoring node health and deciding where new data should live.

Major Components:

API Service: Stateless frontend for RESTful interaction and request validation.

Metadata Service: Key-Value store mapping Bucket+Key to ObjectID and shard locations.

Storage Node: Manages local disk I/O and provides byte-range access to shards.

Placement Service: Tracks cluster topology, disk utilization, and node health.

Scrubber: Background worker for integrity checks and proactive repair.

Simplicity Audit: This design avoids complex global consensus for data by centralizing placement logic in a semi-static service, while allowing metadata to scale horizontally.

Architecture Decision Rationale:

Why?: Erasure coding is preferred over 3x replication for the MVP because at PB scale, the 50% cost saving on disks is massive.

Functional: Satisfies all CRUD operations via simple REST.

Non-functional: NoSQL provides the scalability for metadata; sharding provides durability.

High Level Architecture

Sub-system Deep Dive

Service

Topology & Scaling:

API nodes are stateless, scaled via CPU/Network metrics.

Multi-AZ deployment: Storage nodes are spread across at least 3 Availability Zones.

API Schema Design:

PUT /:bucket/:key: Returns 201 Created. Includes MD5 checksum in header.

GET /:bucket/:key: Returns 200 OK + Byte stream.

Resilience:

Retries: Client-side retries with exponential backoff for 5xx errors.

Timeout: Strict timeouts for storage node I/O (e.g., 500ms) to prevent head-of-line blocking.

Storage

Access Pattern: High write-once, read-many. Large sequential I/O for big blobs.

Database Table Design (Metadata):

Table: Objects

Partition Key: Hash(BucketName)

Sort Key: ObjectKey

Fields: Size, ETag, Owner, CreationTime, List<ShardID>, List<StorageNodeID>.

Technical Selection:

Metadata: Cassandra or DynamoDB for linear scalability and high availability.

Storage Engine: Custom Bitcask-like engine on Storage Nodes for high-speed append-only writes.

Distribution Logic:

Erasure Coding: Data is split into

k

data shards and

m

parity shards (e.g., 6+3). Any

k

shards can reconstruct the object.

Placement: The Placement Service provides a "Writable Set" of nodes to the API Service. API Service streams shards to nodes in parallel.

Data Processing

Processing Model: The Scrubber Service is the only processing component.

Processing DAG: Scan Metadata -> Check Shard Health on Storage Nodes -> Recalculate Parity if Shard Missing -> Write New Shard.

Technical Selection: Custom Go/Rust service for low-level byte manipulation and high-performance Reed-Solomon calculations.

Wrap Up

Advanced Topics

Trade-offs:

Erasure Coding vs Replication: EC uses less disk but significantly more CPU and network bandwidth during reconstruction. We choose EC for cost-efficiency.

Strong vs Eventual Consistency: We achieve strong consistency by ensuring the Metadata Store is updated only after a quorum of data shards is successfully acknowledged by Storage Nodes.

Reliability:

Fault Tolerance: (6+3) EC allows for the simultaneous loss of any 3 storage nodes or 1 entire AZ (if shards are distributed 3 per AZ).

Bottleneck Analysis:

Metadata Hotspots: Large buckets with millions of keys can create hotspots. Mitigated by using a hierarchical partitioning key or consistent hashing on the object key.

Reconstruction Storm: If a rack fails, the Scrubber might saturate the network while rebuilding data. Implement "Priority-based Reconstruction" to rebuild data with only 1 parity left first.

Security:

All data is encrypted using AES-256 at the Storage Node level before hitting the disk.

Access is governed by an internal Policy Engine that checks IAM permissions on every request.