The Question
Design

Design a Real-time Collaborative Document Editor

Design a scalable, real-time collaborative editing system similar to Google Docs. The system must support millions of users, provide sub-100ms latency for keystroke synchronization, and handle concurrent edits gracefully using a robust conflict resolution strategy. Focus on the core mechanics of synchronization, document versioning, and the challenges of maintaining a consistent state across distributed clients while ensuring high availability and durability.
WebSockets
Operational Transformation
Redis
Cassandra
PostgreSQL
PubSub
Anycast
Questions & Insights

Clarifying Questions

Scale: What is the expected Daily Active Users (DAU) and peak concurrent users on a single document?
Assumption: 100M DAU. Up to 100 concurrent editors per document.
Conflict Resolution: Should we support offline editing and reconciliation?
Assumption: Yes, but for the MVP, focus on real-time online collaboration with a centralized server as the single source of truth for versioning.
Rich Media: Does this include images, videos, and complex formatting?
Assumption: Focus on text editing and basic formatting (bold, italic) for the MVP.
Latency: What is the target "keystroke-to-screen" latency for other users?
Assumption: Under 100ms for a seamless collaborative experience.
Consistency: Is strong consistency required for the document state?
Assumption: Eventual consistency across users is acceptable, but the central server must maintain a linearizable history of operations.

Thinking Process

Concurrency Control: How do we handle two people typing in the same spot simultaneously? We will use Operational Transformation (OT) to resolve conflicts server-side.
Real-time Sync: How do we push updates to users? WebSockets provide the bi-directional, low-latency pipe needed for keystroke synchronization.
Document Versioning: How do we ensure everyone stays in sync? Every operation is tagged with a Revision Number. The server acts as the sequencer.
Presence: how do we show who else is in the document? A lightweight Presence Service using heartbeats and a distributed cache.

Bonus Points

OT vs. CRDT: Deep understanding of why Google Docs uses OT (centralized coordination, simpler client logic) vs. why Figma uses CRDTs (Conflict-free Replicated Data Types) for more complex, decentralized graph structures.
Differential Synchronization: Discussing the trade-offs of sending full diffs vs. granular operations to save bandwidth and improve performance on high-latency networks.
Snapshotting Strategy: To avoid replaying millions of operations to open an old doc, we store periodic snapshots (checkpoints) and only apply operations since the last snapshot.
Intelligent Edge Routing: Using Anycast and global WebSockets termination to ensure the user connects to the closest data center, reducing TCP handshake and frame latency.
Design Breakdown

Functional Requirements

Core Use Cases:
Create and edit text documents.
Real-time collaborative editing (multiple users).
Track user presence (who is currently viewing/editing).
Save document state and version history.
Scope Control:
In-Scope: Text editing, concurrency control, basic formatting, real-time presence.
Out-of-Scope: Offline-first conflict resolution (P2P), complex media embedding, comment threads, fine-grained ACL/Permissions (keep it simple for MVP).

Non-Functional Requirements

Scale: Support millions of concurrent documents.
Latency: Keystrokes must be broadcasted within < 100ms.
Availability: 99.99% (highly available for editing).
Consistency: Eventually consistent across clients; strongly consistent sequence of operations on the server.
Durability: Edits must be persisted to disk; no loss of data after the server acknowledges the operation.

Estimation

Traffic:
100M DAU.
Average user makes 100 edits/day.
Total Writes: 10B operations/day ≈ 115k QPS.
Peak QPS: ~300k QPS.
Storage:
1 edit ≈ 100 bytes (metadata + character).
10B operations/day * 100 bytes = 1 TB/day.
1 year = 365 TB (excluding snapshots).
Bandwidth:
Outgoing (to clients): 300k QPS 10 users/doc 100 bytes ≈ 300 MB/s.

Blueprint

Concise Summary: The system utilizes a WebSocket-based architecture for real-time communication, centered around an Operational Transformation (OT) engine that sequences and transforms operations to maintain document consistency.
Major Components:
WebSocket Gateway: Maintains persistent connections to clients for low-latency operation broadcasting.
OT Service: The brain of the system; it receives operations, transforms them against concurrent edits, and assigns a global sequence number.
Document Service: Handles metadata (titles, permissions) and initial document loading.
Presence Service: Tracks active users in a document using ephemeral storage.
Simplicity Audit: This architecture avoids the complexity of decentralized CRDTs by using a central sequencer (OT Service), which is easier to debug and ensures a single source of truth for document history.
Architecture Decision Rationale:
OT Service: Chosen for text-heavy collaboration because it handles overlapping edits efficiently and results in smaller payload sizes than state-based sync.
WebSockets: Mandatory for sub-100ms latency.
Relational Metadata + NoSQL Operations: PostgreSQL handles structured metadata, while a high-throughput NoSQL store (like Cassandra/DynamoDB) stores the append-only log of operations.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery & Traffic Routing: Use a Global Load Balancer with Anycast to route users to the nearest WebSocket Gateway.
Security: SSL/TLS termination at the Gateway. Rate limiting based on UserID to prevent "infinite loop" edit attacks.

Service

Topology & Scaling:
WebSocket Gateway: Stateless, horizontally scaled. Uses Pub/Sub to route messages between gateways if users on the same doc are connected to different servers.
OT Service: Partitioned by DocID. All edits for a specific document are processed by a single leader or a consistent hashing group to ensure strict ordering.
API Schema Design:
POST /v1/doc: Create document.
GET /v1/doc/{id}: Load initial state + latest snapshot.
WebSocket Message: { op: "insert", char: "a", pos: 10, rev: 42 }.
Resilience: Use a "Back-pressure" mechanism. If a client is too slow to receive updates, the server drops the connection and forces a full re-sync.

Storage

Access Pattern: Heavy writes (every keystroke) and heavy reads (loading doc).
Database Table Design:
Documents (SQL): doc_id (PK), owner_id, title, created_at, last_snapshot_id.
Operations (NoSQL): doc_id (Partition), revision (Sort), user_id, op_data.
Technical Selection:
Cassandra: Ideal for the Operations table due to high-write throughput and linear scalability.
PostgreSQL: For metadata where relational integrity and complex queries (e.g., "my documents") are needed.
Distribution Logic: Shard Operations by doc_id. This ensures all ops for one doc are co-located, allowing for fast range scans when catching up from a revision.

Cache

Purpose & Justification: Redis is used for Presence Service (storing {doc_id: [user_1, user_2]}) and Hot Snapshots (storing the current text state of active documents to avoid replaying logs from disk).
Key-Value Schema:
presence:doc_123 -> Set of user_ids (TTL 30s).
snapshot:doc_123 -> {text: "...", rev: 500}.
Failure Handling: If Redis fails, presence is temporarily lost (graceful degradation). Snapshots can be recomputed from the Ops Store.

Messaging

Purpose & Decoupling: A Pub/Sub (Redis Pub/Sub or NATS) is used to broadcast operations between WebSocket Gateway nodes.
Throughput: High frequency, low latency.
Technical Selection: Redis Pub/Sub for MVP due to ultra-low latency and simplicity.
Wrap Up

Advanced Topics

OT vs CRDT Trade-off: OT requires a central server to coordinate the "Canonical Version". This makes the server a bottleneck but simplifies the client (the client doesn't need to store the entire history). CRDT is better for local-first apps but has higher metadata overhead per character.
Reliability: If the OT Service node for a specific DocID fails, we use a consistent hashing ring to reassign the doc to a new node, which then loads the latest ops from Cassandra.
Bottleneck Analysis: A single document with 10,000+ editors (e.g., a viral doc) will saturate a single OT processor.
Optimization: Implement "Read-only" modes or sampling for presence when the user count exceeds a threshold.
Security: Use JWTs for WebSocket authentication. Ensure that the OT Service validates that a user has WRITE access to a DocID before transforming their operation.