The Question
Design

Scalable Object Storage System (S3-compatible)

Design a high-durability, exabyte-scale blob storage system similar to Amazon S3. The system must support basic CRUD operations for unstructured data while ensuring 11 9s of durability and strong consistency for read-after-write. Explain how you would manage massive amounts of metadata, optimize for storage costs at scale, and handle hardware failures or data corruption without service interruption.
NoSQL
Erasure Coding
Cassandra
Reed-Solomon
LSM-tree
RocksDB
HMAC
REST
AES-256
Questions & Insights

Clarifying Questions

Scale of storage and traffic? (Assumption: 10PB of data, 100 billion objects, 10k Write QPS, 50k Read QPS for MVP).
What is the target durability? (Assumption: 11 9s (99.999999999%) is the industry standard for S3, but for MVP we aim for 99.9999% through replication/erasure coding).
Consistency model? (Assumption: Strong consistency for Put/Delete as per modern S3 standards).
Maximum object size? (Assumption: 5GB for a single PUT, multipart upload for anything larger).
Data distribution? (Assumption: Multi-AZ within a single region for the MVP).

Thinking Process

Metadata vs. Data: Decouple object metadata (name, size, owner, permissions) from the actual byte stream (blob) to allow independent scaling.
Durability Strategy: How do we ensure we don't lose data? Use Erasure Coding (e.g., Reed-Solomon) instead of simple replication to save 50% storage space while maintaining high durability.
Data Placement: How to find where a blob is? Use a dedicated Metadata Service backed by a partitioned NoSQL store and a Placement Service to manage disk health.
Progressive flow: Start with a stateless API layer -> Decouple metadata lookup -> Implement a chunk-based storage engine with erasure coding -> Add a background "Scrubber" for data integrity.

Bonus Points

Bit Rot Detection: Implement a background "Scrubber" service that continuously reads data and verifies checksums to detect and repair silent data corruption.
Placement Groups: Awareness of physical topology (Rack, Row, Power Unit) to ensure that a single hardware failure doesn't take out enough erasure-coded shards to cause data loss.
Zero-Copy Transfers: Use sendfile or splice syscalls in the storage nodes to move data from disk to network buffers without CPU-heavy copying.
LSM-tree based Metadata: Use an LSM-tree (like RocksDB) for local metadata on storage nodes to handle high-frequency small writes and updates efficiently.
Design Breakdown

Functional Requirements

Core Use Cases:
CreateBucket(name): Initialize a logical container.
PutObject(bucket, key, data): Store a blob.
GetObject(bucket, key): Retrieve a blob.
DeleteObject(bucket, key): Remove a blob.
ListObjects(bucket, prefix): Search keys within a bucket.
Scope Control:
In-scope: High durability storage, metadata management, basic REST API.
Out-of-scope: CDN integration, versioning, life-cycle policies, cross-region replication.

Non-Functional Requirements

Scale: Support exabyte-scale total capacity and billions of objects.
Latency: Sub-100ms for GET/PUT of small objects (<10MB).
Availability & Reliability: 99.99% availability; high fault tolerance (node/rack failure).
Consistency: Strong consistency for read-after-write.
Security: IAM-based access control and encryption at rest.

Estimation

Traffic:
Writes: 10k QPS * 1MB (avg) = 10 GB/s ingress.
Reads: 50k QPS * 1MB (avg) = 50 GB/s egress.
Storage:
10PB raw data.
Using (6+3) Erasure Coding (1.5x overhead) = 15PB total disk space.
Metadata:
100B objects * 1KB metadata each = 100TB metadata storage.

Blueprint

The architecture follows a decoupled "Metadata-Data" split. The API Gateway handles authentication and routing. The Metadata Service manages object-to-location mapping using a distributed NoSQL store. The Storage Service manages the physical disks, utilizing Erasure Coding to shard data across multiple nodes for durability. A Placement Service acts as the brain, monitoring node health and deciding where new data should live.
Major Components:
API Service: Stateless frontend for RESTful interaction and request validation.
Metadata Service: Key-Value store mapping Bucket+Key to ObjectID and shard locations.
Storage Node: Manages local disk I/O and provides byte-range access to shards.
Placement Service: Tracks cluster topology, disk utilization, and node health.
Scrubber: Background worker for integrity checks and proactive repair.
Simplicity Audit: This design avoids complex global consensus for data by centralizing placement logic in a semi-static service, while allowing metadata to scale horizontally.
Architecture Decision Rationale:
Why?: Erasure coding is preferred over 3x replication for the MVP because at PB scale, the 50% cost saving on disks is massive.
Functional: Satisfies all CRUD operations via simple REST.
Non-functional: NoSQL provides the scalability for metadata; sharding provides durability.

High Level Architecture

Sub-system Deep Dive

Service

Topology & Scaling:
API nodes are stateless, scaled via CPU/Network metrics.
Multi-AZ deployment: Storage nodes are spread across at least 3 Availability Zones.
API Schema Design:
PUT /:bucket/:key: Returns 201 Created. Includes MD5 checksum in header.
GET /:bucket/:key: Returns 200 OK + Byte stream.
Resilience:
Retries: Client-side retries with exponential backoff for 5xx errors.
Timeout: Strict timeouts for storage node I/O (e.g., 500ms) to prevent head-of-line blocking.

Storage

Access Pattern: High write-once, read-many. Large sequential I/O for big blobs.
Database Table Design (Metadata):
Table: Objects
Partition Key: Hash(BucketName)
Sort Key: ObjectKey
Fields: Size, ETag, Owner, CreationTime, List<ShardID>, List<StorageNodeID>.
Technical Selection:
Metadata: Cassandra or DynamoDB for linear scalability and high availability.
Storage Engine: Custom Bitcask-like engine on Storage Nodes for high-speed append-only writes.
Distribution Logic:
Erasure Coding: Data is split into k data shards and m parity shards (e.g., 6+3). Any k shards can reconstruct the object.
Placement: The Placement Service provides a "Writable Set" of nodes to the API Service. API Service streams shards to nodes in parallel.

Data Processing

Processing Model: The Scrubber Service is the only processing component.
Processing DAG: Scan Metadata -> Check Shard Health on Storage Nodes -> Recalculate Parity if Shard Missing -> Write New Shard.
Technical Selection: Custom Go/Rust service for low-level byte manipulation and high-performance Reed-Solomon calculations.
Wrap Up

Advanced Topics

Trade-offs:
Erasure Coding vs Replication: EC uses less disk but significantly more CPU and network bandwidth during reconstruction. We choose EC for cost-efficiency.
Strong vs Eventual Consistency: We achieve strong consistency by ensuring the Metadata Store is updated only after a quorum of data shards is successfully acknowledged by Storage Nodes.
Reliability:
Fault Tolerance: (6+3) EC allows for the simultaneous loss of any 3 storage nodes or 1 entire AZ (if shards are distributed 3 per AZ).
Bottleneck Analysis:
Metadata Hotspots: Large buckets with millions of keys can create hotspots. Mitigated by using a hierarchical partitioning key or consistent hashing on the object key.
Reconstruction Storm: If a rack fails, the Scrubber might saturate the network while rebuilding data. Implement "Priority-based Reconstruction" to rebuild data with only 1 parity left first.
Security:
All data is encrypted using AES-256 at the Storage Node level before hitting the disk.
Access is governed by an internal Policy Engine that checks IAM permissions on every request.