The Question
DesignDesign a Scalable Calendar System
Design a global calendar system similar to Google Calendar. The system must support creating, updating, and deleting events, including complex recurring schedules (e.g., 'every second Tuesday'). It must handle invitations and RSVPs for millions of concurrent users. Key challenges include efficient time-range queries across different timezones, handling recurring event exceptions, and ensuring high availability for viewing schedules while maintaining transactional consistency for event edits. Provide a detailed storage schema and explain how you would scale the system for 100 million daily active users.
PostgreSQL
Redis
Kafka
iCal RRULE
Spanner
Citus
JWT
GSLB
Questions & Insights
Clarifying Questions
Scale: What is the scale of the system in terms of Daily Active Users (DAU) and the number of events per user? (Assumption: 100M DAU, average 100 events per user per year).
Recurrence: Do we need to support complex recurring events (e.g., every first Monday of the month)? (Assumption: Yes, supporting iCal RRULE standard is critical).
Concurrency: How should we handle multiple users editing the same shared event simultaneously? (Assumption: Last-write-wins is acceptable for MVP, but ACID transactions are required for individual event consistency).
Notifications: Are real-time notifications (Push/Email) required? (Assumption: Yes, for event reminders and invitations).
Search: Is full-text search across all historical events required for the MVP? (Assumption: No, we will focus on time-range queries for the calendar view).
Thinking Process
Core Bottleneck: The primary challenge is efficiently querying events within a specific time range (Day/Week/Month view) while handling "expansion" of recurring rules without blowing up storage.
Key Questions for Architecture:
How do we model recurring events to balance storage efficiency vs. query latency?
How do we handle high read-to-write ratios for the "current month" view?
How do we ensure reliable delivery of notifications across millions of users?
How do we shard the database to prevent hot partitions when a popular "public" calendar is viewed by millions?
Bonus Points
Timezone Resilience: Handling "Floating Time" (events that stay at 9 AM regardless of timezone) vs. "Absolute Time" (conference calls) using UTC + Timezone ID.
Conflict Resolution: Using logical clocks or versioning to handle offline edits from mobile devices during synchronization.
Availability Sharding: Sharding by
User_ID to ensure that a single user's calendar experience is highly localized and consistent, while using a secondary index for "Shared/Public" calendars.RRULE Expansion: Implementing a "Virtual Expansion" layer in the application logic to calculate recurring instances on-the-fly for the requested time-window, combined with a "Sync-to-Disk" for exceptions to the rule.
Design Breakdown
Functional Requirements
Core Use Cases:
Create, Update, and Delete events (Single and Recurring).
View calendar by time range (Day, Week, Month).
Invite guests to events and track RSVP status (Accepted, Declined, Tentative).
Receive notifications/reminders before an event starts.
Scope Control:
In-scope: Core event management and time-range queries.
Out-of-scope: Room/Resource booking, full-text search, and third-party calendar integrations (Outlook/Apple sync).
Non-Functional Requirements
Scale: Support 100M DAU with 10k+ QPS for reads.
Latency: Calendar view (read) should load in < 200ms.
Availability & Reliability: 99.99% availability; losing a meeting invite is a high-severity failure.
Consistency: Strong consistency for event edits; eventual consistency for guest RSVP updates is acceptable.
Security & Privacy: Private events must only be visible to invited participants.
Estimation
Traffic Estimation:
Read QPS (View Calendar): 100M users 5 views/day = 500M reads/day ≈ 6,000 QPS**.
Write QPS (Create/Edit): 100M users 0.5 events/day = 50M writes/day ≈ 600 QPS**.
Peak QPS: 5x average ≈ 30,000 Read QPS.
Storage Estimation:
100 events/user/year * 100M users = 10B events/year.
~500 bytes per event (metadata, RRULE, attendee IDs).
10B 500B = 5TB/year**.
Bandwidth Estimation:
Incoming: 600 writes/sec * 1KB = 600 KB/s.
Outgoing: 6000 reads/sec * 5KB (batch of events) = 30 MB/s.
Blueprint
Concise Summary: A microservices-based architecture centered around a "Service-per-Entity" model, using a Relational Database for ACID compliance and a Fan-out pattern for event invitations.
Major Components:
Event Service: Handles the lifecycle of events (CRUD) and stores the recurrence rules.
Query Service: Specialized service that fetches events and "expands" RRULEs into individual instances for a specific time window.
Notification Service: Asynchronous engine for processing and delivering reminders via Kafka.
Relational DB (Spanner/Postgres): Chosen for transactional integrity when updating shared events.
Simplicity Audit: We avoid pre-expanding recurring events into billions of rows in the DB, instead calculating them at read-time for the requested window, which saves massive storage and avoids synchronization nightmares.
Architecture Decision Rationale:
Why this architecture?: Separating the "Write Path" (Event Service) from the "Read Path" (Query Service) allows us to optimize the complex logic of recurrence expansion independently.
Functional Satisfaction: Covers CRUD, range queries, and RSVPs via the Fan-out/Messaging layer.
Non-functional Satisfaction: High availability is achieved through stateless services and a distributed database; latency is managed via caching the current month's view.
High Level Architecture
Sub-system Deep Dive
Edge (Optional)
Content Delivery & Traffic Routing: Global Load Balancer (GSLB) routes users to the nearest regional data center based on IP.
Security & Perimeter:
API Gateway: Handles JWT-based authentication and rate-limiting (e.g., 100 requests per minute per user).
SSL Termination: Performed at the gateway to reduce latency for internal services.
Service
Topology & Scaling: Stateless microservices deployed in Multi-AZ clusters. Autoscale based on CPU and request latency.
API Schema Design:
POST /v1/events: Creates event. Body includes title, start_time, end_time, timezone, rrule (iCal format).GET /v1/events?start=...&end=...: Returns expanded list of events for the window.PUT /v1/events/{id}/rsvp: Updates status for a specific user.Resilience & Reliability:
Retry Policy: Exponential backoff for the Fan-out workers when updating attendee calendars.
Circuit Breaker: Implemented on the Query Service to prevent DB saturation if the Redis cache is down.
Storage
Access Pattern: Heavy read-by-time-range. Frequent updates to the
status of attendees.Database Table Design:
Events: event_id (PK), creator_id, start_time, end_time, timezone, rrule_pattern, metadata.Attendees: event_id (Composite PK), user_id (Composite PK), status (Accepted/Declined), reminders.EventExceptions: parent_event_id, original_time, is_deleted, new_start_time. (To handle "Edit only this instance" logic).Technical Selection: PostgreSQL with Citus or Google Spanner.
Rationale: We need strong ACID transactions to prevent double-booking or inconsistent states when multiple users update a shared event simultaneously.
Distribution Logic: Shard by
user_id. This ensures that all data for a single user's primary calendar resides on one shard, making "Month View" queries extremely fast.Cache
Purpose & Justification: Reduce DB load for the most common operation: viewing the current month.
Key-Value Schema:
Key:
user_calendar:{user_id}:{year}:{month}Value: Serialized list of expanded event instances.
TTL: 1 hour, or invalidated on any write to the
Events or Attendees table for that user.Technical Selection: Redis.
Failure Handling: If Redis is cold or fails, the Query Service falls back to the DB and performs RRULE expansion in-memory.
Messaging
Purpose & Decoupling:
Fan-out Queue: When an event is created with 50 guests, we don't want the user to wait for 50 DB writes.Notification Queue: Decouples the timing logic (reminders) from the event creation logic.Technical Selection: Kafka.
Failure Handling: Dead-letter queues (DLQ) for failed notification attempts.
Infrastructure (Optional)
Observability: Prometheus metrics for P99 latency tracking of the RRULE expansion logic.
Distributed Coordination: Not needed for MVP (sharding is handled by the DB layer).
Wrap Up
Advanced Topics
Trade-offs (RRULE Expansion): Expanding recurring events on-the-fly (Read-time) vs. Pre-calculating (Write-time). We chose Read-time expansion to keep storage lean and simplify "Update All Future Events" logic, which is a nightmare in pre-calculated systems.
Reliability: We use a
ReminderScheduler (part of the Notification Worker) which periodically polls a "Upcoming Reminders" index in the DB (indexed by reminder_time) to push messages to Kafka.Bottleneck Analysis: A "Public" calendar (e.g., Holidays) with millions of subscribers could create a hot shard.
Optimization: Public calendars are cached at the Edge (CDN) or in a global Redis cluster since they are read-only for most users.
Security: Row Level Security (RLS) or application-level checks ensure
user_A cannot query user_B's events unless an entry exists in the Attendees table.