The Question
DesignScalable GIF Collection and Sharing System
Design a backend system that allows users to organize GIFs into personal collections. Users must be able to create collections, add or remove GIFs (stored as references), and share these collections with other users either privately or via a public link. The system should handle 10 million daily active users and support high-read volume for shared content. Detail the data model for permissions, the strategy for handling 'viral' shared collections, and how to ensure low-latency performance for global users.
PostgreSQL
Redis
CDN
JWT
UUID
API Gateway
Questions & Insights
Clarifying Questions
Scale & Traffic: What is the expected Daily Active User (DAU) count and how many collections does an average user maintain?
Data Content: Are we hosting the GIF files ourselves, or are we storing references/URLs to external providers (e.g., Giphy, Tenor)?
Sharing Model: Does sharing imply collaborative editing or "view-only" access? Does it require a public link or specific user-to-user permissions?
Search & Discovery: Do users need to search within their collections or discover public collections?
Assumptions for MVP:
DAU: 10 million.
Storage: Reference-based (storing URLs/IDs of GIFs), not the binary blobs, to optimize for MVP costs.
Sharing: Primarily view-only sharing via internal User ID or a unique obfuscated link.
Consistency: Read-your-writes for the owner; eventual consistency for shared viewers is acceptable.
Thinking Process
The core challenge is managing the many-to-many relationship between users, collections, and GIFs at scale while ensuring low-latency access for sharing.
Relational vs. NoSQL: How do we model the ownership and sharing permissions? (Relational is better for the MVP to handle ACID for ownership and complex "Who can see what" queries).
Handling "Hot" Collections: How do we prevent database hotspots when a celebrity shares a collection link that goes viral? (Edge caching and Read Replicas).
Data Modeling for Retrieval: How do we structure the
Collection_GIFs table to allow fast pagination and ordering? (Clustered indexing on collection_id).Security of Sharing: How do we ensure shared links are secure and revocable? (Signed URLs or UUID-based lookup tables).
Bonus Points
Denormalized Metadata: Storing the first 3 GIF thumbnails directly in the
Collection metadata to render "Collection Covers" in a single query without joining the mapping table.Bloom Filters: Using Bloom Filters at the application level to quickly check if a GIF already exists in a collection before hitting the database.
Write-Through Caching: Implementing a write-through strategy for the owner's "My Collections" view to ensure zero-latency perception after adding a GIF.
Geographic Locality: Utilizing a globally distributed database (like CockroachDB or Spanner) for collection metadata to ensure low-latency sharing across continents.
Design Breakdown
Functional Requirements
Core Use Cases:
Users can create/rename/delete collections.
Users can add/remove GIFs (by URL/External ID) to/from collections.
Users can list all their collections.
Users can share a collection with another specific user or via a unique link.
Scope Control:
In-Scope: Collection management, sharing permissions, and metadata storage.
Out-of-Scope: GIF hosting/transcoding, GIF search engine (external API usage assumed), collaborative real-time editing.
Non-Functional Requirements
Scale: Support 100M+ total collections and 5B+ GIF-to-Collection mappings.
Latency: Collection loading should be < 100ms (P95).
Availability & Reliability: 99.99% availability; collection data must not be lost (Durability is priority).
Consistency: Strong consistency for ownership changes; Eventual consistency for shared views.
Security: Private collections must not be accessible without explicit permission or valid share-token.
Estimation
Traffic:
10M DAU.
Read/Write ratio: 20:1.
Write QPS (Create/Add): ~115 (10M * 1 op / 86400).
Read QPS (View): ~2,300 (10M * 20 / 86400). Peak QPS: ~10,000.
Storage:
100M collections * 500 bytes (Metadata) = 50 GB.
5B GIF mappings * 100 bytes (ID + Order) = 500 GB.
Total DB storage: < 1 TB (Very manageable for modern RDBMS).
Bandwidth:
Negligible, as we only transfer IDs and URLs, not binary GIF data.
Blueprint
The design utilizes a classic Three-Tier Architecture optimized for read-heavy sharing workflows.
Major Components:
API Gateway: Handles authentication and global rate limiting.
Collection Service: A stateless microservice managing the business logic of collection CRUD and sharing permissions.
Relational Database: PostgreSQL is used to maintain strict consistency for permissions and ownership.
Distributed Cache: Redis stores hot collection data and user session/permission checks.
Simplicity Audit: This design avoids complex event-sourcing or NoSQL sharding initially, as the data volume (1 TB) fits comfortably within a high-memory RDBMS instance or a simple primary-replica setup.
Architecture Decision Rationale:
Why RDBMS?: Sharing involves "Access Control Lists" (ACLs). Relational databases excel at joins and transactional integrity required for "Who has access to this folder."
Functional Satisfaction: Covers all CRUD and sharing flows via simple relational tables.
Non-functional Satisfaction: Scalability is handled via Read Replicas and Redis; Availability is handled via Multi-AZ deployment.
High Level Architecture
Sub-system Deep Dive
Edge (Optional)
Content Delivery & Traffic Routing:
CDN: Used to cache the
GetCollection responses (JSON) for public/shared collections at edge locations.Security & Perimeter:
API Gateway: Handles JWT verification.
Rate Limiting: Limits "Create Collection" to 10/minute per user to prevent spam.
Service
Topology & Scaling:
Stateless Services: Collection Service scales horizontally based on CPU/Request count.
Multi-AZ: Instances deployed across 3 Availability Zones for fault tolerance.
API Schema Design:
POST /v1/collections: Create new.POST /v1/collections/{id}/gifs: Add GIF.GET /v1/collections/{id}: Fetch GIFs in collection.POST /v1/collections/{id}/share: Generate share token or assign user permission.Idempotency:
X-Request-ID header for Add/Remove GIF operations to prevent duplicate entries on retry.Resilience & Reliability:
Circuit Breakers: Applied to the External GIF Provider API to prevent latency spikes from cascading if Giphy/Tenor is slow.
Storage
Access Pattern: 95% reads (viewing collections). Primary key lookups by
collection_id.Database Table Design:
Collections:
id (UUID), owner_id (FK), title, is_public (Bool), share_token (String, indexed), created_at.Collection_GIFs:
id, collection_id (FK), gif_external_id (String), display_order (Int), added_at. Index on (collection_id, display_order).Permissions:
collection_id, user_id, access_level (Read/Write). Primary Key: (collection_id, user_id).Technical Selection: PostgreSQL.
Rationale: Support for JSONB (for any future flexible metadata) and strong performance for indexed joins.
Reliability & Recovery:
Daily snapshots + WAL (Write Ahead Log) archiving to S3 for Point-in-Time Recovery.
Cache
Purpose & Justification: Reduces DB load for frequently viewed collections (e.g., "Trending GIFs" or viral shared links).
Key-Value Schema:
coll:{id} -> JSON blob of GIF IDs. TTL: 30 minutes.perm:{user_id}:{coll_id} -> Access Level. TTL: 5 minutes.Failure Handling: If Redis is down, the service falls back to the Read Replica DB.
Wrap Up
Advanced Topics
Trade-offs: We chose PostgreSQL over DynamoDB. While DynamoDB scales infinitely, it makes handling "User-to-User Sharing" more complex without duplicating data or using Global Secondary Indexes heavily. For 1TB of data, RDBMS simplicity wins for an MVP.
Reliability: Using Read Replicas ensures that even if the Primary DB is under write-heavy load (e.g., massive batch imports), users can still browse their collections.
Bottleneck Analysis: The primary bottleneck will be the
Collection_GIFs table as it grows to billions of rows. Optimization: We can shard this specific table by
collection_id if growth exceeds a single node's capacity.Security: Shareable links use high-entropy UUIDs (Share Tokens) instead of auto-incrementing IDs to prevent ID-enumeration attacks.