The Question
DesignScalable Multi-Channel Notification System
Design a highly available and scalable notification system capable of delivering millions of messages across Push (iOS/Android), SMS, and Email. The system must handle varying priorities (e.g., OTP vs Marketing), ensure at-least-once delivery, manage user preferences, and gracefully handle third-party provider failures or rate limits. Detail the architectural choices for decoupling, message persistence, and handling high-concurrency peak loads.
Kafka
Redis
PostgreSQL
Cassandra
APNs
FCM
Twilio
SendGrid
JWT
OAuth2
gRPC
Flink
Questions & Insights
Clarifying Questions
Scale and Throughput: What is the expected volume of notifications per day? Are there specific peak periods (e.g., marketing blasts or breaking news)?
Latency SLAs: Are there different priority levels for notifications? For example, should an OTP (One-Time Password) have a lower latency requirement than a weekly newsletter?
Delivery Guarantees: Is "exactly-once" delivery required, or is "at-least-once" acceptable? (Usually, at-least-once is standard for notifications to avoid the overhead of distributed transactions).
Third-Party Providers: Do we have specific vendors (e.g., Twilio for SMS, SendGrid for Email) or should the system be provider-agnostic to allow for failover?
User Preferences: Does the system need to manage user opt-in/opt-out and preferred channels (e.g., "only send SMS if push fails")?
Assumptions:
Scale: 100 million notifications per day.
Latency: High-priority messages (OTP) < 5 seconds; Low-priority < 30 minutes.
Reliability: At-least-once delivery is the primary goal.
Providers: We will use APNs (iOS), FCM (Android), Twilio (SMS), and SendGrid (Email).
Thinking Process
Decoupling is King: How do we prevent a slow third-party provider (like a lagging SMS gateway) from backing up the entire system?
Priority Queueing: How do we ensure a massive marketing blast doesn't delay critical account security alerts?
Idempotency & Deduplication: How do we prevent sending the same notification twice if a worker retries a failed task?
Rate Limiting & Throttling: How do we protect both our internal services and our users from notification fatigue?
Bonus Points
Smart Retries with Exponential Backoff: Implementing jittered backoff logic specifically tuned to different error types (e.g., 429 Too Many Requests vs. 500 Internal Server Error).
Provider Failover Strategy: Real-time switching between SMS/Email providers based on health checks and delivery success rates.
Fan-out Optimization: Handling notifications for "Million-follower" accounts using a hybrid push/pull model or specialized shard routing.
Message Deduplication via Bloom Filters: Using memory-efficient data structures to check for duplicate requests at the entry point for high-velocity streams.
Design Breakdown
Functional Requirements
Core Use Cases:
Send push notifications to iOS and Android devices.
Send SMS messages to mobile numbers.
Send emails to user addresses.
Support template-based message generation.
Provide an endpoint for users to manage notification preferences (opt-in/out).
Scope Control:
In-scope: API for ingestion, prioritization, channel-specific workers, and basic delivery tracking.
Out-of-scope: Complex analytics/click-through tracking, advanced user segmentation/targeting (marketing engine), and in-app message inbox UI.
Non-Functional Requirements
Scale: Support 100M+ notifications/day with peaks of 10k QPS.
Latency: P99 < 5 seconds for high-priority messages.
Availability & Reliability: 99.99% availability; messages should not be lost once accepted by the API.
Consistency: Eventual consistency for user preference updates.
Fault Tolerance: Handle third-party provider outages gracefully via retries and queues.
Security: Secure storage of device tokens and PII (Phone numbers/Emails).
Estimation
Traffic Estimation:
100M notifications / 86,400 seconds \approx 1,150 average QPS.
Peak QPS (10x) \approx 11,500 QPS.
Storage Estimation:
Metadata (User ID, Token, Status) \approx 500 bytes per notification.
100M * 500 bytes = 50 GB/day.
Retain logs for 30 days = 1.5 TB.
Bandwidth Estimation:
Inbound: 11,500 QPS * 1 KB/request \approx 11.5 MB/s.
Outbound: Similar, plus overhead for provider-specific payloads.
Blueprint
Concise Summary: A microservices-based architecture utilizing a distributed message queue (Kafka) to decouple ingestion from delivery. Workers consume from prioritized topics and dispatch to third-party providers.
Major Components:
Notification API: The entry point for all notification requests, performing validation and authentication.
Task Queue (Kafka): Acts as a buffer and provides prioritization by separating high and low priority traffic into different topics.
Workers: Scalable consumer groups that handle template rendering and provider-specific API calls (APNs, FCM, etc.).
Metadata Store: Persists user preferences, device tokens, and notification status.
Simplicity Audit: This design avoids complex real-time stream processing or custom delivery protocols, relying on industry-standard providers and proven queueing patterns.
Architecture Decision Rationale:
Why this architecture?: Queues are essential because third-party providers have unpredictable latencies and rate limits. Kafka allows for high throughput and replayability if a worker group fails.
Functional Satisfaction: Covers all 4 channels (iOS, Android, SMS, Email) through specialized worker logic.
Non-functional Satisfaction: Scalable horizontally; provides fault tolerance through persistent queueing.
High Level Architecture
Sub-system Deep Dive
Edge (Optional)
Content Delivery & Traffic Routing: Not heavily utilized for notification sending, but the API Gateway uses Geo-DNS to route requests to the nearest regional data center.
Security & Perimeter:
API Gateway: Handles SSL termination and OAuth2/JWT validation.
Rate Limiting: Implemented at the API Gateway to prevent any single internal service from overwhelming the notification system (e.g., a buggy script sending millions of emails).
Service
Topology & Scaling: Stateless Notification Services deployed in multiple Availability Zones (AZs). Scaling is based on Request-per-Second (RPS) and CPU.
API Schema Design:
POST /v1/notificationsProtocol: REST/JSON
Request:
{ "user_id": "string", "priority": "high/low", "content": { "title": "...", "body": "..." }, "channels": ["push", "sms"] }Idempotency: Supports
X-Idempotency-Key header to prevent duplicate sends on client retries.Resilience & Reliability:
Circuit Breaker: If Twilio returns 5xx errors consistently, the SMS Worker trips the circuit and potentially routes to a backup provider.
Storage
Access Pattern:
Heavily Read-optimized for user preferences and device tokens (looked up for every notification).
Heavily Write-optimized for delivery logs.
Database Table Design:
UserDevices:
user_id (PK), device_token, platform (ios/android), last_active.UserPreferences:
user_id (PK), channel_type, enabled (bool).Technical Selection:
PostgreSQL: Stores user preferences and device tokens. Relational integrity ensures users aren't messaged on deleted accounts.
Cassandra/ClickHouse: Used for
Delivery Logs to handle high-volume write throughput for auditing.Distribution Logic: Sharded by
user_id to ensure all data for a single user resides on the same partition.Cache
Purpose & Justification: Reduce DB load for frequent lookups of device tokens and rate-limit counters.
Key-Value Schema:
Key:
token:{user_id}, Value: List<DeviceTokens>, TTL: 24h (updated on app launch).Key:
rl:{user_id}:{channel}, Value: Count, TTL: 1h.Technical Selection: Redis. High performance and supports complex data types like Sets for tokens.
Messaging
Purpose & Decoupling: Decouples the notification request from the slow process of network I/O with third-party vendors.
Event / Topic Schema:
Topics:
notification.high_priority, notification.low_priority, notification.sms, notification.email.Throughput & Partitioning: Partitioned by
user_id to maintain message ordering per user (e.g., "Order Placed" must arrive before "Order Delivered").Technical Selection: Kafka. Chosen for its high-throughput capabilities and durability.
Data Processing
Processing Model: Stream processing for real-time delivery tracking and aggregation.
Processing DAG:
Provider Webhook -> Kafka (Status Topic) -> Flink/Spark -> Log DB.Technical Selection: Flink (Optional for MVP, can be simple Python/Go workers). For MVP, we use Worker Services consuming from Kafka.
Infrastructure (Optional)
Observability:
Metrics: Monitor "Queue Depth" (crucial for detecting bottlenecks) and "Provider Error Rates".
Tracing: Jaeger/OpenTelemetry to trace a notification from API call to provider submission.
Wrap Up
Advanced Topics
Trade-offs: We chose At-least-once over Exactly-once. Exactly-once requires expensive two-phase commits or heavy distributed locking, which would kill throughput. We handle duplicates at the consumer level using idempotency keys.
Reliability: If a provider is down, messages stay in Kafka. We can add a "Dead Letter Queue" (DLQ) for messages that fail after 5 retries for manual inspection.
Bottleneck Analysis: The biggest bottleneck is usually the third-party provider's rate limits. We implement Client-side Throttling in our workers to stay within Twilio/FCM quotas.
Security: Device tokens and PII are encrypted at rest (AES-256) and in transit (TLS 1.3).
Distinguishing Insights:
Smart Batching: For low-priority emails, workers can batch messages to SendGrid to reduce the number of API calls and improve throughput.
Adaptive Throttling: The system can dynamically lower the ingestion rate if it detects a global spike in provider latency.