The Question
DesignScalable Multi-Channel Notification System
Design a high-throughput notification system capable of delivering 100M+ messages daily across Push (iOS/Android), Email, and SMS. The system must support message prioritization (e.g., transactional OTPs vs. marketing blasts), handle third-party provider failures gracefully, and ensure at-least-once delivery semantics. Discuss how you would handle global scalability, rate limiting, and delivery tracking.
Kafka
Redis
NoSQL
DynamoDB
gRPC
APNs
FCM
SendGrid
Twilio
Load Balancer
Microservices
Questions & Insights
Clarifying Questions
What is the expected scale of the system? (Assumption: 100 million notifications per day, with peak traffic reaching 5,000 QPS).
What types of notifications are supported? (Assumption: Push via APNs/FCM, Email via SendGrid/SES, and SMS via Twilio/Nexmo).
Is delivery guaranteed? (Assumption: At-least-once delivery is required; exactly-once is preferred but not strictly required for an MVP).
Are there different priority levels? (Assumption: Yes, high-priority for OTPs/Transactional and low-priority for Marketing/Promotions).
Do we need to track delivery status? (Assumption: Yes, basic tracking of "Sent" and "Failed" status is required for the MVP).
Thinking Process
Decoupling and Asynchronicity: Use a message queue to isolate the submission of a notification from the actual delivery process, ensuring the API remains responsive.
Third-Party Resilience: Third-party providers (Twilio, SendGrid) are the most likely points of failure; the system must implement retries, circuit breakers, and provider-switching logic.
Priority-Based Processing: Use separate queues for different priority levels to prevent a large marketing blast from delaying critical OTPs.
How to ensure scalability? Horizontal scaling of stateless workers and sharding the notification log database.
How to handle rate limiting? Implement a global rate-limiter to prevent spamming users and to respect third-party provider limits.
Bonus Points
Idempotency Keys: Implement a unique hash (user_id + content_hash + timestamp_bucket) to prevent duplicate notifications during retries or network glares.
Smart Provider Routing: Dynamically route traffic to different providers based on real-time delivery success rates and cost-optimization algorithms.
Global Delivery Optimization: Use Geo-DNS and regional worker deployments to minimize latency for local push/SMS gateways.
Feedback Loop Integration: Consume webhooks from providers to update delivery status and automatically blacklist bounced email addresses or invalid phone numbers.
Design Breakdown
Functional Requirements
Core Use Cases:
Users/Services can send Push, Email, and SMS notifications.
Support for notification templates (dynamic placeholders).
Ability to check notification status.
Scope Control:
In-Scope: Delivery, Template management, Basic rate limiting, Multi-provider support.
Out-of-Scope: User preference center (opt-in/out UI), Advanced analytics/dashboards, Complex scheduling (e.g., "send in 3 days").
Non-Functional Requirements
Scale: Support up to 100M notifications/day.
Latency: High-priority notifications (OTPs) delivered in < 10 seconds (P99).
Availability & Reliability: 99.99% availability; "At-least-once" delivery guarantee.
Consistency: Eventual consistency for delivery status logs.
Fault Tolerance: Automatic retries with exponential backoff and circuit breaking for failing providers.
Security: Authentication for internal services, PII encryption for phone numbers and emails.
Estimation
Traffic Estimation:
100M notifications / 86,400 seconds ≈ 1,150 Average QPS.
Peak QPS (5x) ≈ 5,750 QPS.
Storage Estimation:
Metadata per notification ≈ 1KB.
100M * 1KB = 100GB per day.
For 30 days of searchable history: 3TB.
Bandwidth Estimation:
Incoming: 1,150 QPS * 1KB ≈ 1.15 MB/s.
Outgoing (to third parties): Similar, depending on template rendering size.
Blueprint
Concise Summary: A microservices-based architecture where notifications are validated, prioritized, and queued for asynchronous delivery by specialized worker nodes.
Major Components:
API Gateway: Entry point for internal services to request notifications.
Notification Service: Handles validation, template rendering, and initial persistence.
Priority Queue: Decouples the request from the delivery and manages priority.
Workers: Stateless services that interface with third-party providers.
Metadata Store: Persists notification status and logs.
Simplicity Audit: This design avoids complex "stream processing" or "distributed coordination" tools, relying on standard message queues and stateless workers which is the simplest way to achieve 100M/day scale.
Architecture Decision Rationale:
Why this architecture?: Messaging queues provide natural load-leveling and fault tolerance.
Functional Satisfaction: Supports all three channels and tracking.
Non-functional Satisfaction: High throughput is handled by Kafka; reliability is handled by persistent queues and retries.
High Level Architecture
Sub-system Deep Dive
Service
Topology & Scaling:
Notification Service: Stateless, scaled based on CPU and Request Count.
Workers: Scaled based on "Queue Depth." If the SMS queue grows, SMS worker instances increase.
API Schema Design:
Endpoint:
POST /v1/notificationsProtocol: gRPC for low-latency internal communication.
Request:
{ user_id, channel, template_id, params, priority }Response:
{ notification_id, status: "Accepted" }Idempotency: Requests must include a
request_id to prevent double-sending.Resilience & Reliability:
Retries: Workers use exponential backoff for 5xx errors from providers.
Circuit Breaker: If Twilio returns 503s consistently, the SMS worker trips the breaker and routes traffic to Nexmo.
Storage
Access Pattern:
High write volume (100M/day).
Lookups by
notification_id or user_id.TTL-based data (logs usually only needed for 30-90 days).
Database Table Design:
Table: notifications
notification_id (UUID, Primary Key)user_id (Indexed)channel (Push/Email/SMS)status (Pending/Sent/Failed/Delivered)content_payload (JSON)created_at, updated_atTechnical Selection: NoSQL (e.g., DynamoDB or Cassandra).
Rationale: Linear scalability for high write throughput and native TTL support for log rotation.
Distribution Logic: Sharded by
user_id to ensure all notifications for a single user are collocated, allowing efficient "recent notifications" queries.Cache
Purpose & Justification: Caching frequently used notification templates and user opt-out preferences to reduce DB load.
Key-Value Schema:
template:{template_id} -> Rendered string with placeholders.user_pref:{user_id} -> Bitmask of allowed channels.Technical Selection: Redis.
Rationale: Sub-millisecond latency for template lookups during high-volume rendering.
Messaging
Purpose & Decoupling: Provides the buffer between the API and the slow third-party providers.
Event / Topic Schema:
Topics:
notify.push.high, notify.push.low, notify.email.high, etc.This allows dedicated workers to consume critical OTPs without being blocked by bulk emails.
Failure Handling:
Dead Letter Queue (DLQ): Messages that fail after N retries are moved to a DLQ for manual inspection or secondary processing.
Technical Selection: Kafka.
Rationale: High throughput, partitions allow parallel processing, and data retention allows replaying events if the consumer service fails.
Infrastructure (Optional)
Observability:
Metrics: Track "Delivery Latency" (time from API call to Provider Success) and "Failure Rate per Provider."
Tracing: Use Jaeger/Zipkin to trace a notification from the initial API request through the queue to the worker.
Wrap Up
Advanced Topics
Trade-offs:
At-least-once vs. Exactly-once: We chose at-least-once delivery. In distributed systems, exactly-once is extremely hard to achieve (requires two-phase commits across queues and DBs). We mitigate duplicates using idempotency keys at the worker level.
Reliability:
System handles provider outages by maintaining a pool of providers for each channel. If SendGrid is down, the system fails over to Amazon SES.
Bottleneck Analysis:
Third-party Throttling: The system must respect the "Send Quota" of providers. The workers use a distributed rate limiter (Redis-based) to ensure we don't get banned by Twilio or APNs.
Distinguishing Insights:
Template Rendering: Moving template rendering to the Notification Service (before the queue) reduces the payload size in the queue (only storing the rendered string), but rendering it in the worker allows for "last-minute" personalization. For MVP, rendering before the queue is safer for data consistency.