The Question
Design

Scalable Multi-Channel Notification Engine

Design a highly available notification system capable of delivering messages across Push, SMS, and Email channels. The system should handle 10M+ daily notifications, provide abstraction over multiple third-party delivery providers, manage user channel preferences, and ensure reliable delivery through retry mechanisms and asynchronous processing.
Kafka
Redis
PostgreSQL
REST
Kubernetes
DLQ
Questions & Insights

Clarifying Questions

What is the expected scale (DAU and total notifications per day)?Assumption: 10 million notifications per day with peaks of 2,000 requests per second (RPS).
What delivery channels must be supported?Assumption: Push notifications (iOS/Android), SMS, and Email.
What are the reliability requirements?Assumption: At-least-once delivery is required. Some latency is acceptable for email, but push/SMS should be near real-time.
Do we need to handle user preferences and opt-outs?Assumption: Yes, the system must respect user-defined settings for each channel.
Is there a requirement for message templating?Assumption: Yes, internal services will provide parameters, and the system will populate templates.

Thinking Process

Core Strategy: Decouple the notification trigger from the actual delivery using an asynchronous message queue. This protects the system from third-party provider latency and enables independent scaling.
Key Questions for Design Flow:
How do we ensure the API remains responsive while downstream providers are slow? (Asynchronous Processing)
How do we prevent overloading third-party providers like Twilio or SendGrid? (Rate Limiting & Throttling)
How do we handle transient failures in delivery? (Retry logic with exponential backoff)
How do we manage cross-channel user preferences efficiently? (Metadata Storage & Cache)

Bonus Points

Smart Provider Failover: Implement a circuit breaker pattern that automatically switches between providers (e.g., Twilio to Nexmo) if success rates drop below a threshold.
Deduplication Logic: Use a "Message-ID" hash in Redis with a short TTL to prevent "double-tap" sends caused by network retries.
Priority-Based Queuing: Implement separate queues for "Critical" (2FA, Password Reset) vs. "Marketing" notifications to ensure transactional messages are never blocked by bulk campaigns.
Idempotent Consumers: Design workers to be idempotent to handle the "at-least-once" delivery semantics of the message queue.
Design Breakdown

Functional Requirements

Support multi-channel delivery (SMS, Email, Push).
Message templating (HTML for email, plain text for SMS).
User preference management (Opt-in/Opt-out per category).
Notification status tracking (Sent, Delivered, Failed).

Non-Functional Requirements

High Availability: System must be operational 99.99% of the time.
Scalability: Handle sudden spikes (e.g., "Flash Sale" alerts).
Reliability: Zero data loss for notifications once accepted by the API.
Latency: End-to-end delivery under 10 seconds for 95% of messages.

Estimation

Traffic: 10M notifications/day \approx 115 notifications/sec (average). Peak \approx 1,000 - 2,000 RPS.
Storage: 10M logs/day. If each log is 500 bytes, that's 5GB/day. 1.8TB/year.
Cache: User preferences for 10M active users. 10M * 100 bytes \approx 1GB RAM (Fits in a single small Redis instance).

Blueprint

Concise Summary: A microservice architecture using a REST API for ingestion, a distributed message queue for buffering, and worker nodes for executing multi-channel delivery via 3rd-party SDKs.
Major Components:
Notification API: Validates requests and persists initial notification state.
Metadata DB: Stores user preferences, templates, and delivery logs.
Cache: High-speed lookup for user settings and rate-limit counters.
Message Queue: Decouples the API from slow external providers.
Notification Workers: Formats messages using templates and calls external APIs.
Simplicity Audit: This design uses a classic producer-consumer pattern which is the minimum requirement to handle third-party latency and retries without blocking the main application flow.
Architecture Decision Rationale:
Why this architecture is the best for this problem?: It provides high durability and isolation. If the Email provider goes down, SMS and Push are unaffected.
Functional Requirement Satisfaction: Templates and preferences are checked before queuing; workers handle specific channel logic.
Non-functional Requirement Satisfaction: Scaling is horizontal (add more workers); Kafka/SQS provides high availability and durability.

High Level Architecture

Sub-system Deep Dive

Service

Topology & Scaling: Stateless API and Worker nodes deployed in Kubernetes (EKS/GKE) across 3 Availability Zones. Auto-scaling triggered by CPU for API and "Queue Depth" for Workers.
API Schema Design:
POST /v1/notifications
Protocol: REST/JSON.
Request: { user_id: string, type: string, template_id: string, params: map }.
Idempotency: Client provides request_id header to prevent duplicate sends.
Rate Limit: 100 requests/sec per calling service.
Resilience & Reliability: Workers use exponential backoff (1s, 2s, 4s...) for 5xx errors from providers.
Observability: Prometheus metrics for "Delivery Latency" and "Provider Success Rate". Structured logs in ELK stack for auditing.
Security: API secured via JWT; internal service-to-service communication via mTLS.

Storage

Access Pattern: Heavy write (logs), medium read (preferences/templates). High consistency for preferences.
Database Table Design:
users_preferences: user_id (PK), channel, is_enabled.
notification_templates: id (PK), content, type.
notification_logs: id (PK), user_id, status, created_at. Indexed on user_id.
Technical Selection: PostgreSQL.
Rationale: Strong consistency for preferences and robust indexing for log queries. Fits MVP scale perfectly.
Distribution Logic: Partition notification_logs by created_at (monthly) to keep indexes performant.
Reliability & Recovery: Daily snapshots to S3; WAL (Write-Ahead Logging) for PITR.

Cache

Purpose & Justification: Reduces DB load for frequently accessed user preferences and message templates.
Key-Value Schema:
pref:{user_id} -> JSON string of settings.
tmpl:{template_id} -> Template string.
TTL: 24 hours (with write-through on update).
Technical Selection: Redis.
Rationale: Sub-millisecond latency and built-in eviction policies (LRU).
Failure Handling: If Redis is down, fall back to PostgreSQL (Graceful Degradation).

Messaging

Purpose & Decoupling: Acts as a buffer. Prevents the API from timing out if SendGrid or Twilio are slow.
Event / Topic Schema: Topics per channel: notification.email, notification.sms, notification.push.
Throughput & Partitioning: Kafka partitions based on user_id to ensure message ordering for a single user (though usually not critical for notifications).
Failure Handling: Failed messages are moved to a Dead Letter Queue (DLQ) after 5 retry attempts for manual inspection.
Technical Selection: Kafka.
Rationale: High throughput and data retention allows replaying messages if a worker bug is discovered.
Wrap Up

Advanced Topics

Trade-offs (Consistency vs Availability): The system favors Availability. A notification might be slightly delayed (Eventual Consistency in logs), but the system remains available to accept new requests.
Reliability: Uses a "Transactional Outbox" pattern implicitly. The API writes to the DB and Queue. To be truly robust, we could write to DB first and use a CDC (Change Data Capture) tool to push to the Queue, but for MVP, a dual-write with proper error handling is sufficient.
Bottleneck Analysis: Third-party rate limits are the primary bottleneck. Workers must implement "token bucket" rate limiting to stay within provider quotas.
Security: PII (Phone numbers/Emails) should be encrypted at rest in the database.
Distinguishing Insights: For high-scale systems, prioritize "Transactional" over "Marketing" queues. If a marketing blast fills the queue with 1M messages, a 2FA code should skip the line using a Priority Queue.