The Question
Design

Scalable Cross-Channel Notification System

Design a high-throughput notification system supporting Push, SMS, and Email for 100 million daily active users. The system must handle 1 billion notifications daily, respect user-specific preference settings, and ensure reliable delivery despite unreliable 3rd-party provider APIs. Address how you would handle traffic spikes (e.g., breaking news) and ensure at-least-once delivery semantics while minimizing latency.
Kafka
Redis
PostgreSQL
API Gateway
Circuit Breaker
gRPC
Docker
Kubernetes
AES-256
Questions & Insights

Clarifying Questions

Scale & Volume: What is the expected daily active user count and the peak notification volume (e.g., a breaking news event)?
Assumption: 100M DAU, 1 Billion notifications per day, peak QPS of 100k.
Channels: Which delivery channels are required for the MVP?
Assumption: Mobile Push (iOS/Android), SMS, and Email.
Latency & Reliability: What are the latency requirements and the acceptable delivery guarantee?
Assumption: Near real-time (< 1s for 99th percentile push), At-least-once delivery guarantee.
Personalization: Do we need to handle user preferences (opt-outs, quiet hours)?
Assumption: Yes, the system must respect user-defined channel preferences and global opt-outs.
Priority: Should the system distinguish between transactional (OTP) and marketing (Newsletter) notifications?
Assumption: Yes, high-priority notifications should bypass marketing queues.

Thinking Process

Core Bottleneck: The primary challenge is the external dependency on 3rd-party providers (APNS, FCM, Twilio, SendGrid) which are prone to throttling and intermittent failures.
Strategy:
How do we decouple request ingestion from delivery to handle bursts? Use a persistent Message Queue.
How do we prevent 3rd-party failures from taking down our system? Implement Circuit Breakers and Exponential Backoff.
How do we ensure low latency for user preferences? Use a Cache-aside pattern for user settings.
How do we handle scale? Partition the workers by notification type or channel.

Bonus Points

Idempotency Keys: Use unique notification IDs generated at the edge to prevent duplicate deliveries during retries (Distributed deduplication).
Provider Agility: Implement an abstraction layer for 3rd-party providers to allow dynamic failover (e.g., if Twilio is down, switch to Vonage).
Smart Throttling: Implement token-bucket rate limiting per user and per provider to comply with downstream SLAs and prevent "spamming" users.
Dead Letter Queues (DLQ): Sophisticated handling of "poison pills" and exhausted retries for auditability and manual intervention.
Design Breakdown

Functional Requirements

Core Use Cases:
Send a notification to a specific user via Push, SMS, or Email.
Manage user notification preferences (enable/disable specific channels).
Track delivery status (Sent, Delivered, Failed).
Scope Control:
In-scope: Backend API, message queuing, preference validation, and 3rd-party integration.
Out-of-scope: Analytics dashboard for marketing, message template creation UI, in-app notification inbox (notification center).

Non-Functional Requirements

Scale: Support 1B messages/day with peak bursts of 10x average load.
Latency: End-to-end delivery under 1 second for Push notifications.
Availability: 99.99% availability; the system must not lose messages if a downstream provider is down.
Consistency: Eventual consistency for delivery status logs; preference updates must be reflected quickly.
Fault Tolerance: Automatic retries with jitter and circuit breaking for external APIs.
Security: Secure storage of device tokens and PII (Phone/Email) using encryption at rest.

Estimation

Traffic Estimation:
1B notifications / 86,400s \approx 11,500 Average QPS.
Peak QPS (10x) \approx 115,000 QPS.
Storage Estimation:
Notification Log: 100 bytes per record. 1B records/day = 100 GB/day.
30-day retention = 3 TB.
Bandwidth Estimation:
Average payload 1KB. 11.5k QPS * 1KB \approx 11.5 MB/s (Inbound).
Outbound to providers is roughly equivalent.

Blueprint

Concise Summary: A microservices architecture centered around a distributed message queue to decouple ingestion from delivery, utilizing workers to interface with 3rd-party providers.
Major Components:
API Gateway: Entry point for authentication, rate limiting, and request validation.
Notification Service: Validates requests, fetches user preferences, and pushes events to the queue.
Redis Cache: Stores user settings and device tokens to avoid database hits for every message.
PostgreSQL: System of record for user preferences and notification metadata.
Kafka: High-throughput message bus for buffering and prioritizing notifications.
Notification Workers: Consumers that execute the actual delivery logic and handle retries.
Simplicity Audit: This design avoids complex stream processing or multi-region synchronization for the MVP, focusing instead on reliable queuing and worker scaling.
Architecture Decision Rationale:
Why this architecture?: Message queues provide the necessary "buffer" to protect the system against spikes and provider slowness.
Functional Satisfaction: Covers multi-channel support through polymorphic workers and respects preferences via the metadata check.
Non-functional Satisfaction: Kafka provides the required 100k+ QPS throughput and persistence for reliability.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery & Traffic Routing: Global Load Balancer (L7) using Round Robin or Least Connections to distribute traffic to the API Gateway.
Security & Perimeter:
API Gateway: Performs JWT validation and extracts user_id.
Rate Limiting: Tiered limits (e.g., 100 requests/sec for marketing bots, unlimited for internal system-critical alerts).
WAF: Standard protection against SQLi and XSS on the ingestion endpoints.

Service

Topology & Scaling: Stateless Notification Services deployed across multiple Availability Zones (AZs). Auto-scaling based on CPU and Request Count.
API Schema Design:
POST /v1/notifications
Protocol: REST/JSON.
Request: { "user_id": "123", "type": "ORDER_CONFIRM", "priority": "high", "content": { "title": "...", "body": "..." } }
Idempotency: Header X-Idempotency-Key required.
Resilience & Reliability:
Circuit Breaker: If Twilio returns 5xx errors above a threshold, the SMS Worker stops sending and redirects to a DLQ or secondary provider.
Timeouts: Aggressive timeouts (2s) for 3rd-party calls to prevent worker exhaustion.

Storage

Access Pattern:
High Read: Fetching user preferences and device tokens.
High Write: Notification delivery logs and status updates.
Database Table Design:
Users: user_id, email, phone_number, created_at.
UserPreferences: user_id, channel_type, is_enabled, updated_at.
DeviceTokens: user_id, platform (iOS/Android), token.
Technical Selection:
PostgreSQL: Chosen for ACID compliance on user settings. Use Partitioning by user_id for scale.
Distribution Logic: Sharding by user_id to ensure all data for a single user resides on one shard, simplifying preference lookups.

Cache

Purpose & Justification: Reduces latency for preference lookups (from ~50ms DB to ~2ms Cache) and protects the DB from 100k QPS.
Key-Value Schema:
pref:{user_id} -> JSON blob of preferences.
tokens:{user_id} -> List of active device tokens.
TTL: 24 hours with LRU eviction.
Technical Selection: Redis (Cluster Mode) for high availability and sub-millisecond latency.

Messaging

Purpose & Decoupling: Kafka acts as the buffer. It decouples the API (ingestion) from the slow 3rd-party networks.
Throughput & Partitioning:
Topics: notif.push, notif.sms, notif.email.
Partition Key: user_id (ensures order of messages for a single user).
Failure Handling:
Retry Queue: Messages that fail due to transient errors (429, 503) are moved to a retry topic with a delay.
DLQ: Messages that fail after 5 retries or have invalid payloads.

Data Processing

Processing Model: Stream processing via Notification Workers (Go or Java for high concurrency).
Processing Logic:
Consume message from Kafka.
Double-check cache for most recent opt-out status.
Format payload using a local template engine.
Call 3rd-party SDK.
On success, update status log (async).
Technical Selection: Custom Go-based workers for low memory footprint and excellent concurrency (Goroutines) for handling blocking I/O.

Infrastructure (Optional)

Observability:
Metrics: Track "Delivery Latency" (End-to-end) and "Provider Error Rate".
Logging: Structured logs with trace_id propagated from the API Gateway to the Workers.
Wrap Up

Advanced Topics

Trade-offs:
Consistency vs. Availability: We choose Eventual Consistency for delivery logs to ensure the system remains available for sending notifications.
At-least-once Delivery: We accept that a user might rarely receive a duplicate notification if a worker crashes after sending to a provider but before committing the Kafka offset.
Bottleneck Analysis:
Hot Partitions: A celebrity "push all" could hit a single Kafka partition. Optimization: Use a "Broadcast Topic" with no partition key (Round Robin) for large-scale marketing blasts.
Security:
PII (Phone/Email) must be encrypted using AES-256 before being stored in PostgreSQL.
Workers use short-lived IAM roles to access provider secrets in HashiCorp Vault or AWS Secrets Manager.
Optimization:
Batching: Email providers often support batch APIs (send to 1000 users in one call). Workers should aggregate small messages before calling the provider.