The Question
DesignScalable Notification System Design
Design a high-volume notification system capable of delivering 100M+ messages daily across Push, Email, and SMS. The system must handle high-priority transactional alerts (like OTPs) alongside lower-priority marketing campaigns, respect user opt-out preferences, and maintain high availability even during third-party provider outages. Address trade-offs between delivery latency, cost, and reliability.
Kafka
Redis
PostgreSQL
DynamoDB
Twilio
SendGrid
FCM
APNS
gRPC
Kubernetes
Questions & Insights
Clarifying Questions
Scale and Throughput: What is the expected volume of notifications per day, and what is the peak QPS?
Delivery Guarantees: Is "at-least-once" delivery sufficient, or do we require "exactly-once" (which is significantly harder with 3rd party providers)?
Latency SLAs: Are there strict requirements for delivery time (e.g., OTPs must arrive within 10 seconds)?
User Preferences: Should the system handle opt-in/opt-out settings and quiet hours?
Prioritization: Do we need to support different priority levels (e.g., transactional OTPs vs. marketing promos)?
Assumptions for MVP:
Scale: 100 million notifications per day.
Latency: Near real-time (< 5 seconds for 99% of notifications).
Channels: Push (FCM/APNS), Email (SendGrid/SES), SMS (Twilio).
Reliability: At-least-once delivery guarantee.
Prioritization: High-priority (OTP) vs. Low-priority (Marketing).
Thinking Process
How do we handle massive spikes without crashing downstream providers? Use a message queue to buffer requests and implement rate limiting per channel.
How do we ensure notifications aren't lost if a worker fails? Implement persistent queuing (Kafka/SQS) and a retry mechanism with exponential backoff.
How do we manage user channel preferences and templates? Centralize user settings in a high-read-optimized database and use a template engine to separate content from logic.
How do we track the status of a notification across various vendors? Implement a state machine and a feedback loop (webhooks) to track "Sent -> Delivered -> Opened" status.
Bonus Points
Vendor Agnostic Routing: Implement a "Provider Registry" to dynamically switch between providers (e.g., Twilio to Plivo) if one experiences an outage or higher costs.
Smart Throttling & Batching: Automatically batch non-urgent marketing emails to reduce connection overhead and respect provider rate limits.
Idempotency Keys: Use a unique hash of (UserID + NotificationType + Content) to prevent accidental double-sending within a short window.
Deliverability Analytics: Real-time monitoring of bounce rates and delivery failures to proactively detect if IPs are blacklisted by ISPs.
Design Breakdown
Functional Requirements
Core Use Cases:
Trigger notifications via internal API.
Support Push, Email, and SMS channels.
Manage and inject variables into message templates.
Respect user opt-out preferences and channel settings.
Scope Control:
In-scope: API for ingestion, queuing, worker delivery, and basic status tracking.
Out-of-scope: In-app notification center UI, complex AI-based scheduling, rich media editing tools.
Non-Functional Requirements
Scale: Support 1k+ average QPS with peaks up to 5k QPS.
Latency: Ingestion latency < 100ms; end-to-end delivery < 5s for high priority.
Availability & Reliability: 99.99% availability for the ingestion API; at-least-once delivery.
Consistency: Eventual consistency for user preference updates is acceptable.
Security & Privacy: Support PII encryption (e.g., phone numbers/emails) at rest and TLS in transit.
Estimation
Traffic: 100M notifications / 86400s \approx 1,200 avg QPS. Peak (10x) \approx 12,000 QPS.
Storage:
Metadata: 1KB per notification. 100M * 1KB = 100GB/day.
Retention: 30 days of logs = 3TB.
Bandwidth: 1,200 QPS * 1KB \approx 1.2MB/s (negligible for modern infra).
Blueprint
Concise Summary: A microservices-based architecture where an Ingestion API validates requests and pushes them to a distributed message queue (Kafka). Specialized workers consume from the queue, fetch user preferences/templates, and dispatch to 3rd party providers.
Major Components:
Notification API: Validates incoming requests and generates unique Notification IDs.
Kafka (Messaging Layer): Decouples ingestion from delivery and provides persistence for retries.
Notification Workers: Multi-threaded consumers that handle the heavy lifting of calling external APIs.
User & Template Store: Provides user preferences and pre-defined message templates.
Simplicity Audit: This design avoids complex distributed transactions by using an event-driven approach with a durable queue, satisfying the MVP's reliability needs without over-engineering.
Architecture Decision Rationale:
Why this architecture?: Message queues are the standard for high-volume notification systems to handle bursty traffic and provide fault tolerance.
Functional Satisfaction: Covers all channels and preferences via a modular worker/service approach.
Non-functional Satisfaction: Scalability is achieved by partitioning Kafka and horizontal scaling of workers.
High Level Architecture
Sub-system Deep Dive
Edge (Optional)
Content Delivery & Traffic Routing: Internal services call the API via regional Load Balancers.
Security & Perimeter:
API Gateway: Handles AuthN/AuthZ for internal services.
Rate Limiting: Applied at the service level to prevent a single internal "rogue" service from exhausting the notification quota.
Service
Topology & Scaling: Stateless microservices deployed in Kubernetes (K8s). Scale based on CPU and Kafka Consumer Lag.
API Schema Design:
POST /v1/notificationsProtocol: REST/gRPC
Request:
{ "user_id": "u1", "type": "ORDER_SHIPPED", "priority": "high", "payload": {...} }Idempotency:
idempotency_key header required.Resilience & Reliability:
Retries: Workers use exponential backoff for 5xx errors from providers.
Circuit Breaker: If Twilio is down, open the circuit to stop making calls and route to Dead Letter Queue (DLQ).
Storage
Access Pattern:
User Preferences: High-read, low-write.
Notification Logs: High-write, sequential.
Database Table Design:
User Preferences:
user_id (PK), channel, enabled, quiet_hour_start, quiet_hour_end.Notification Status:
notification_id (PK), user_id, channel, status (PENDING, SENT, DELIVERED, FAILED), retry_count.Technical Selection:
PostgreSQL: For user preferences and templates (Relational consistency needed).
Cassandra/DynamoDB: For notification status/logs to handle high-write throughput and TTL-based expiry.
Distribution Logic: Shard Notification Status DB by
user_id to allow efficient retrieval of a user's notification history.Cache
Purpose & Justification: Reduce DB load for frequently accessed user preferences and message templates.
Key-Value Schema:
Key:
pref:{user_id}, Value: JSON (preferences), TTL: 24h.Key:
tmpl:{template_id}, Value: String (body), TTL: 1h.Technical Selection: Redis. Provides sub-millisecond latency and simple eviction policies (LRU).
Messaging
Purpose & Decoupling: Kafka acts as a buffer. Ingestion Service doesn't wait for 3rd party providers.
Throughput & Partitioning:
Use
user_id as the partition key to ensure order for a single user (e.g., OTP before Promo).Separate topics for
high_priority and low_priority to prevent marketing blasts from delaying OTPs.Failure Handling: Failed messages after N retries are moved to a Dead Letter Queue (DLQ) for manual inspection or secondary retry logic.
Technical Selection: Kafka. High throughput and high availability.
Infrastructure (Optional)
Observability:
Metrics: Track
delivery_latency, error_rate_by_provider, and queue_depth.Distributed Tracing: Trace a notification from the internal trigger through the queue to the final provider call.
Wrap Up
Advanced Topics
Trade-offs: We chose Eventual Consistency for delivery status. A user might see a "Sent" status a few seconds before the provider actually confirms delivery.
Reliability: We use At-least-once delivery. This means a user might rarely receive two notifications if the worker crashes after sending but before acknowledging the message in Kafka.
Bottleneck Analysis: The main bottleneck is often the 3rd Party Provider rate limits. We solve this by implementing per-provider rate limiters in our workers.
Distinguishing Insights:
Smart Selection: If a user is active on the app (checked via a Presence Service), we prioritize Push over SMS to save costs.
Feedback Loop: Use webhooks from SendGrid/Twilio to update the Notification Status DB. This allows us to prove delivery to the business side.