The Question
DesignScalable Multi-Channel Notification Engine
Design a highly reliable notification system capable of sending push notifications, emails, and SMS messages to millions of users globally. The system must handle high-volume bursts while ensuring low latency and fault tolerance against third-party provider outages.
Kafka
PostgreSQL
Redis
Delivery Workers
Third-Party APIs
Questions & Insights
Clarifying Questions
Scale and Throughput: What is the expected daily active users (DAU) and the peak notifications per second (QPS)?
Delivery Guarantees: Is "At-Least-Once" delivery acceptable, or do we strictly require "Exactly-Once"?
Notification Types: Do we need to support priority levels (e.g., OTP vs. Marketing) and scheduling (send at a specific time)?
Integration: Will we use third-party providers (Twilio, SendGrid, FCM/APNS) or internal gateways?
User Management: Should the system handle user preferences (opt-out, quiet hours) and device token management?
Assumptions for MVP:
Scale: 10M DAU, ~100M notifications per day (Avg 1,100/sec, Peak 5,000/sec).
Delivery: At-least-once delivery is sufficient.
Latency: High-priority notifications delivered within < 10 seconds.
Providers: Use Twilio (SMS), SendGrid (Email), and FCM/APNS (Push).
Thinking Process
Decoupling and Buffering: How do we prevent a surge in requests from crashing the system? Answer: Use distributed message queues to decouple the ingestion API from delivery workers.
Provider Resilience: How do we handle third-party provider downtime or rate limits? Answer: Implement dedicated worker pools per channel with exponential backoff retries and circuit breakers.
Data Integrity: Where do we store device tokens and user preferences to ensure we don't spam users? Answer: A high-performance metadata store (PostgreSQL) with a caching layer (Redis) for frequent lookups.
Abstracting the Delivery Logic: How do we easily add new channels or providers? Answer: Use the Strategy Pattern in workers to wrap provider-specific SDKs.
Bonus Points
Idempotency Keys: Use a unique
notification_id or deduplication_hash (content + recipient + timestamp) to prevent duplicate delivery during worker retries.Smart Rate Limiting: Implement per-user and per-channel rate limiting to avoid getting banned by providers or annoying users.
Sharded Metadata Store: Use horizontal partitioning for the
DeviceTokens table by user_id to support scaling to billions of rows.Dead Letter Queues (DLQ): Divert messages that fail after maximum retries to a DLQ for manual inspection or secondary fallback logic.
Design Breakdown
Functional Requirements
Send notifications via SMS, Email, and Push.
Support notification templates (placeholders for dynamic data).
Manage user device tokens and contact info.
Honor user preferences (opt-in/opt-out for specific channels).
Provide a unified API for internal services to trigger notifications.
Non-Functional Requirements
High Availability: 99.9% uptime for the ingestion API.
Scalability: Handle sudden spikes (e.g., breaking news or marketing campaigns).
Low Latency: Minimal delay between trigger and provider handoff.
Reliability: No notification loss after acceptance by the API.
Estimation
Requests: 100M notifications/day \approx 1,150 requests/sec average.
Peak Load: 5x average \approx 5,750 requests/sec.
Storage (Metadata): 100M logs/day * 200 bytes per log \approx 20GB/day. 30-day retention \approx 600GB.
Cache: 10M active users * 1KB preference data \approx 10GB RAM (easily fits in a small Redis cluster).
Blueprint
Concise Summary: A microservices-based architecture using a persistent message queue to buffer requests and specialized workers to interact with third-party delivery providers.
Major Components:
Notification API: Validates requests, fetches templates, and persists the initial record.
Metadata DB (PostgreSQL): Stores user contact info, device tokens, and templates.
Message Queue (Kafka): Acts as the backbone for asynchronous processing and load leveling.
Delivery Workers: Channel-specific consumers that handle provider-specific logic and retries.
Simplicity Audit: This design avoids complex "batching" services or advanced ML delivery-time optimization, focusing purely on reliable delivery and separation of concerns.
Architecture Decision Rationale:
Asynchronous Processing: Prevents slow third-party APIs from blocking the internal calling service.
Functional Satisfaction: Covers all 3 required channels via the "Strategy" pattern in workers.
Non-functional Satisfaction: Kafka provides durability and high throughput; worker horizontal scaling ensures low latency.
High Level Architecture
Sub-system Deep Dive
Service
Topology: Notification API is a stateless Golang/Java service deployed in an Auto-scaling Group (ASG) behind an ALB.
API Spec:
POST /v1/send: Takes user_id, template_id, priority, and metadata (JSON).Protocol: Internal gRPC or REST.
Validation: Checks if
user_id exists and if the template_id is valid before queuing.Storage
Data Model:
Users: user_id, email, phone_number, settings_json.DeviceTokens: id, user_id, token, platform (iOS/Android), last_seen.Templates: id, name, content_body (with placeholders).Database Logic: Use B-Tree index on
user_id for quick preference lookups. For logs, use a time-series approach or partition by month for easy archival.Cache
Data Structure: Redis Hashes.
Key:
user_prefs:{user_id}.TTL: 24 hours, refreshed on update.
Logic: Store device tokens and opt-out flags here to minimize DB hits during the hot path of ingestion.
Messaging
Topic Structure:
notification.email: High-throughput topic for emails.notification.sms: High-priority topic for SMS (shorter retention).notification.push: High-throughput topic for push notifications.Delivery Guarantees: Producers use
acks=all. Consumers commit offsets only after the third-party provider returns a 2xx or a terminal error.Wrap Up
Advanced Topics
Monitoring:
Prometheus/Grafana: Monitor queue depth (backlog), provider error rates (429s, 5xx), and end-to-end latency (trigger to delivery).
Trade-offs:
Consistency vs Availability: We choose "Eventual Consistency" for delivery status logs to ensure the ingestion API remains highly available.
Bottlenecks:
Third-party rate limits are the primary bottleneck. We use distributed rate limiters (Redis-based) to ensure we don't exceed Twilio/SendGrid quotas.
Failure Handling:
Retries: Workers use a 3-tier retry strategy (Immediate, 1-min, 5-min).
Circuit Breaker: If Twilio returns 503s repeatedly, stop sending SMS and fail fast to preserve resources.
Alternatives:
Alternative Messaging: AWS SQS could replace Kafka for lower operational overhead if the team is small, though Kafka offers better replayability.