The Question
DesignObject Storage System
Design a large-scale object storage service similar to AWS S3. The system should provide durable and highly available blob storage, support versioning and access control policies, deliver objects via a global CDN, and handle exabytes of data across multiple availability zones.
Consistent Hashing
Erasure Coding
LSM-Tree
Distributed KV Store
Async Task Queue
Questions & Insights
Thinking Process
To design a massive-scale blob storage system like S3, we focus on the decoupling of Control Plane (Metadata) and Data Plane (Actual Bits).
How do we handle multi-petabyte scale? We separate metadata (filenames, permissions, locations) from data (actual bytes) to allow independent scaling.
How do we guarantee durability? We implement synchronous replication across multiple "Availability Zones" or failure domains before acknowledging a write.
How do we find a specific file among billions? We use a distributed Key-Value store for metadata indexing, keyed by the
Bucket + ObjectKey.How do we handle large file uploads efficiently? We implement "Multipart Uploads," breaking files into chunks that are uploaded in parallel and reassembled logically.
Bonus Points
Erasure Coding (Staff Level): Moving from simple 3x replication to Reed-Solomon erasure coding to reduce storage overhead from 300% to ~150% while maintaining higher durability.
Bitrot Protection: Implementing background "Scrubbers" that periodically verify checksums of cold data to detect and repair silent data corruption.
LSM-Tree Metadata Indexing: Using an LSM-tree based storage (like RocksDB or DynamoDB) for metadata to handle the high-write throughput of object creation.
Request Hedging: For "Get" requests, if a storage node is slow (p99 latency), the system sends a parallel request to a replica to ensure low latency.
Design Breakdown
Functional Requirements
PutObject: Upload a file (up to 5GB for single PUT, TBs for Multipart).
GetObject: Download a file via its Key.
DeleteObject: Mark a file for deletion.
ListObjects: List keys within a bucket (prefix-based).
Versioning: Maintain multiple versions of an object (Optional but standard).
Non-Functional Requirements
High Durability: Aiming for 99.999999999% (11 9s).
High Availability: 99.99% availability.
Scalability: Support billions of objects and exabytes of data.
Strong Consistency: Provide "Read-after-Write" consistency for new objects.
Estimation
Daily Active Users: 10M.
Write/Read Ratio: 1:10 (Read heavy).
Average Object Size: 1MB.
Ingest Rate: 100k objects/sec.
Throughput: 100k * 1MB = 100 GB/s.
Storage (1 year): 100k 86400 365 * 1MB ≈ 3.1 Exabytes.
Metadata Storage: 1KB per metadata entry * 3 Trillion objects ≈ 3PB (Requires a massive distributed KV store).
Blueprint
Concise Summary: A distributed architecture where a "Front-End" service handles requests, stores file metadata in a distributed NoSQL DB, and streams the raw bytes to "Storage Nodes" that write directly to disk.
Major Components:
API Front-End: Stateless workers handling authentication, rate limiting, and request routing.
Metadata Store: A distributed KV store (e.g., DynamoDB or Cassandra) for fast object-to-physical-location mapping.
Storage Nodes (Blob Store): Large clusters of servers with high-density disks that store raw data chunks.
Metadata Cache: Fast lookups for hot object metadata to reduce DB load.
Garbage Collection (Messaging): Async processing for physical deletion of data.
Simplicity Audit: For the MVP, we use 3x replication instead of Erasure Coding. This avoids complex parity calculations and allows simple "success if 2 of 3 nodes write" logic.
High Level Architecture
Sub-system Deep Dive
Service
Topology & Scaling: API Front-Ends are stateless and horizontally scaled behind an Anycast LB.
API Spec:
PUT /bucket/key: Streams body to storage nodes. Calculates MD5 checksum on the fly.GET /bucket/key: Retrieves metadata, identifies storage node location, and streams bytes to client.Multipart Upload: Client initiates a session -> receives
UploadID -> Uploads parts with PartNumber -> Finalize call merges the parts logically in the Metadata DB.Storage
Data Model (Metadata):
Partition Key:
BucketNameSort Key:
ObjectKeyAttributes:
Size, ETag (MD5), CreationTime, StorageNodeIDs (List of Chunks).Database Logic: Metadata DB must support strong consistency. We use a consensus-based KV store (like Etcd for coordination or DynamoDB with consistent reads).
Physical Storage: Objects are split into chunks (e.g., 64MB). Each chunk is stored as a file on the local filesystem of a Storage Node.
Cache
Included in Diagram: Yes.
Detail: Redis is used to cache "Hot Metadata" (Object Location and Permissions). Since objects in S3 are immutable (updates create new versions), the cache invalidation strategy is simple: Write-through or Time-to-Live (TTL). This significantly reduces the latency for
GetObject and HeadObject requests.Messaging
Included in Diagram: Yes.
Detail: An AWS SQS-style message queue handles "Garbage Collection" tasks. When a user calls
DeleteObject, we only mark it "deleted" in the Metadata DB (logical delete). A message is sent to the Queue, which the Garbage Collector consumes to physically unlink files from disks on the Storage Nodes.Wrap Up
Advanced Topics
Trade-offs:
Replication vs Erasure Coding: We chose 3x replication for the MVP. It uses 3x the disk space but simplifies the write path and recovery logic (no CPU-intensive decoding).
Consistency: We choose Strong Consistency for PUTs. This slightly increases write latency as we must wait for a quorum (2/3) of nodes, but it prevents "stale read" bugs for developers.
Bottlenecks:
Metadata Hotkeys: If one bucket has millions of requests, the metadata partition might become a hotspot. We solve this by adding a "Sharding Prefix" to the internal metadata keys.
Failure Handling:
Node Failure: If a storage node dies, a "Replication Manager" detects the under-replicated chunks and initiates background copying from remaining replicas to a new node.
Alternatives & Optimization:
Optimization: Use "Direct I/O" on storage nodes to bypass the OS page cache, as the API service handles its own caching, avoiding redundant memory usage.
Alternative: For ultra-cold data, we could use Tape Drives or SMR (Shingled Magnetic Recording) disks to save costs.