The Question
DesignReal-time Collaborative Document Editor
Design a system similar to Google Docs that enables multiple users to edit the same document simultaneously in real-time. The system must handle conflict resolution, maintain document history, and show user presence (who is currently online). Focus on low-latency synchronization (under 100ms) and high availability for millions of concurrent documents. Discuss the trade-offs between different consistency models and how you would scale the synchronization engine.
WebSockets
Operational Transformation
Redis
Cassandra
S3
NoSQL
JWT
Questions & Insights
Clarifying Questions
Scale and Concurrency: What is the expected scale in terms of Daily Active Users (DAU) and the maximum number of concurrent editors on a single document?
Conflict Resolution Model: Should we prioritize a centralized Operational Transformation (OT) approach (like Google Docs) or a decentralized Conflict-free Replicated Data Type (CRDT) approach?
Content Types: Is the MVP limited to plain text/rich text, or must it support embedded media, tables, and complex formatting immediately?
Offline Support: Is offline editing with asynchronous reconciliation a requirement for the MVP?
Versioning: Do we need a full version history/audit log, or just the current state of the document?
Assumptions:
Scale: 10M DAU; peak concurrent editors per document is 50.
Latency: Sub-100ms end-to-end latency for a seamless "real-time" feel.
Conflict Resolution: Centralized Operational Transformation (OT) will be used for simpler server-side ordering and lower client-side complexity.
Scope: Focus on rich-text collaboration, document persistence, and user presence (who is online).
Thinking Process
Core Bottleneck: The primary challenge is the "C" in CAP—maintaining a consistent document state across multiple users with high-frequency updates while minimizing perceived latency.
Step 1: How do we synchronize changes? Use WebSockets for bi-directional, low-latency communication of "Operations" (Insert/Delete).
Step 2: How do we resolve conflicts? Implement a central OT Engine that sequences operations and transforms them relative to the concurrent changes already applied to the server's master version.
Step 3: How do we handle performance? Use Document Snapshotting to prevent clients from having to replay thousands of operations to load the current state.
Step 4: How do we scale? Implement Session Stickiness or a distributed pub-sub (Redis) to ensure all editors of the same document are connected to the same processing context or coordinated efficiently.
Bonus Points
Differential Synchronization: Discussing the trade-offs between sending full diffs vs. atomic operations (OT).
Fractional Indexing: Using specialized indexing (like Jitterbug or decimal strings) to handle concurrent insertions between two characters without re-indexing the whole document.
Checksum Validation: Periodically sending a document hash to clients to detect state divergence and trigger a "forced resync" if the OT logic drifts.
Operational Compaction: Designing a background job to merge a long tail of atomic operations into a single snapshot to optimize storage and document load times.
Design Breakdown
Functional Requirements
Core Use Cases:
Real-time collaborative editing (multiple users see each other's changes instantly).
Document creation, retrieval, and persistence.
User Presence (highlighting who is currently viewing/editing).
Scope Control:
In-Scope: Text editing, OT-based conflict resolution, basic permissions, snapshots.
Out-of-Scope: Offline mode, complex table formatting, image/video processing, comments/suggestions (post-MVP).
Non-Functional Requirements
Scale: Support 10M DAU and millions of stored documents.
Latency: <100ms for local echoing and <500ms for remote synchronization.
Availability & Reliability: 99.9% availability; document data must not be lost (high durability).
Consistency: Eventual consistency across all clients, but strong ordering of operations per document via the central OT server.
Fault Tolerance: Handle sudden WebSocket disconnections gracefully without corrupting document state.
Estimation
Traffic:
10M DAU, 10% active at any time = 1M concurrent users.
Avg 1 operation (keystroke/format) per second per active user = 1M QPS (Writes/Operations).
Storage:
100M total documents 100KB average size = 10 TB for current state**.
Operation log (history) can be 10x larger = 100 TB.
Bandwidth:
1M Ops/sec 500 bytes (metadata + char) = 500 MB/s (Incoming)**.
Outgoing is amplified by the number of collaborators (e.g., 5 users per doc = 2.5 GB/s Outgoing).
Blueprint
Concise Summary: A WebSocket-based architecture using a centralized Operational Transformation (OT) engine to serialize and transform document edits, backed by a NoSQL store for operations and an Object Store for snapshots.
Major Components:
WebSocket Gateway: Manages persistent connections and routes operations to the appropriate session handler.
OT Collaboration Service: The "brain" that receives operations, transforms them against the version history, and broadcasts them to other collaborators.
Document Repository: Stores the "Source of Truth" operations log.
Presence Service: Tracks active users in a document using a TTL-based cache.
Simplicity Audit: This design avoids the complexity of decentralized CRDTs, which are harder to implement for rich text and have higher metadata overhead, opting for the industry-standard OT approach.
Architecture Decision Rationale:
OT vs CRDT: OT allows for a single "Server Version" which simplifies conflict resolution and allows for a smaller client-side footprint.
NoSQL for Ops: MongoDB or Cassandra is ideal for storing an ordered list of operations per document ID.
High Level Architecture
Sub-system Deep Dive
Edge (Optional)
Content Delivery & Traffic Routing: Use a Global Load Balancer with WebSocket support.
Security & Perimeter: SSL termination at the Load Balancer. JWT-based authentication passed in the initial WebSocket upgrade request.
Rate Limiting: Per-user limits on operation frequency to prevent "keyboard smashing" DoS attacks.
Service
Topology & Scaling:
Collaboration Service: Stateful-ish. While the service can be horizontal, a specific document session should ideally be handled by one node or coordinated via Redis Pub/Sub to maintain operation ordering.
API Schema Design:
submitOperation` (WebSocket):
Payload:
{ docId: string, userId: string, baseVersion: int, op: { type: 'insert'|'delete', pos: int, val: string } }fetchDocument` (REST):
Returns:
{ latestSnapshot: string, version: int, pendingOps: [] }Resilience:
Acknowledgment Pattern: Clients wait for a server
ACK before sending the next batch of operations. This prevents the "out of order" transformation nightmare.Storage
Access Pattern:
High write (append-only) for operations.
High read for initial document load.
Database Table Design (Operations DB):
doc_id (Partition Key)version_seq (Clustering Key/Sort Key)user_id, op_data (JSON), timestampTechnical Selection:
Operations DB: Cassandra or DynamoDB. Excellent for append-only time-series-like data (operations sequence).
Snapshot Store: Amazon S3 or MongoDB. Stores the full state of the document at version
X every 100 operations.Distribution Logic: Shard by
doc_id. This ensures all operations for a single document are co-located, allowing for efficient range scans of operations since the last snapshot.Cache
Purpose: Presence (who is editing) and Session Routing.
Key-Value Schema:
Key:
presence:{docId}, Value: Set<userId>.TTL: 30 seconds (heartbeat-based).
Technical Selection: Redis. Used for low-latency presence updates and as a message bus for the Collaboration Service to broadcast ops to other server nodes.
Data Processing
Processing Model: Snapshot Worker (Background).
Logic: Periodically reads the last 100 operations for a document, applies them to the last snapshot, and writes a new snapshot to the Object Store. It then marks old operations for archival/deletion (Log Compaction).
Technical Selection: Temporal or a simple K8s CronJob to trigger the consolidation.
Infrastructure (Optional)
Observability: Track
transformation_latency (how long the OT engine takes) and websocket_count.Wrap Up
Advanced Topics
Trade-offs (OT vs CRDT):
OT requires a central server to sequence operations. This makes it easier to enforce permissions and "single truth" but creates a scaling challenge for the central server.
CRDT allows for decentralized merges (P2P), but it's significantly more complex to implement for rich text formatting and results in much larger data overhead (tombstones, unique IDs per character).
Reliability: If the Collaboration Service node fails, the client re-establishes a WebSocket with another node. The new node fetches the latest state from the Operations DB.
Optimization: Operational Grouping. Instead of sending a WebSocket message for every single character, the client buffers keystrokes for 50-100ms and sends them as a single "chunked" operation.
Security: Use TLS 1.3 for all WebSocket traffic. Implement Document-level ACLs (Access Control Lists) checked at the start of every session.