The Question

Scalable Cross-Channel Notification System

Design a high-throughput notification system supporting Push, SMS, and Email for 100 million daily active users. The system must handle 1 billion notifications daily, respect user-specific preference settings, and ensure reliable delivery despite unreliable 3rd-party provider APIs. Address how you would handle traffic spikes (e.g., breaking news) and ensure at-least-once delivery semantics while minimizing latency.

Kafka

Redis

PostgreSQL

API Gateway

Circuit Breaker

gRPC

Docker

Kubernetes

AES-256

Questions & Insights

Clarifying Questions

Scale & Volume: What is the expected daily active user count and the peak notification volume (e.g., a breaking news event)?

Assumption: 100M DAU, 1 Billion notifications per day, peak QPS of 100k.

Channels: Which delivery channels are required for the MVP?

Assumption: Mobile Push (iOS/Android), SMS, and Email.

Latency & Reliability: What are the latency requirements and the acceptable delivery guarantee?

Assumption: Near real-time (< 1s for 99th percentile push), At-least-once delivery guarantee.

Personalization: Do we need to handle user preferences (opt-outs, quiet hours)?

Assumption: Yes, the system must respect user-defined channel preferences and global opt-outs.

Priority: Should the system distinguish between transactional (OTP) and marketing (Newsletter) notifications?

Assumption: Yes, high-priority notifications should bypass marketing queues.

Thinking Process

Core Bottleneck: The primary challenge is the external dependency on 3rd-party providers (APNS, FCM, Twilio, SendGrid) which are prone to throttling and intermittent failures.

Strategy:

How do we decouple request ingestion from delivery to handle bursts? Use a persistent Message Queue.

How do we prevent 3rd-party failures from taking down our system? Implement Circuit Breakers and Exponential Backoff.

How do we ensure low latency for user preferences? Use a Cache-aside pattern for user settings.

How do we handle scale? Partition the workers by notification type or channel.

Bonus Points

Idempotency Keys: Use unique notification IDs generated at the edge to prevent duplicate deliveries during retries (Distributed deduplication).

Provider Agility: Implement an abstraction layer for 3rd-party providers to allow dynamic failover (e.g., if Twilio is down, switch to Vonage).

Smart Throttling: Implement token-bucket rate limiting per user and per provider to comply with downstream SLAs and prevent "spamming" users.

Dead Letter Queues (DLQ): Sophisticated handling of "poison pills" and exhausted retries for auditability and manual intervention.

Design Breakdown

Functional Requirements

Core Use Cases:

Send a notification to a specific user via Push, SMS, or Email.

Manage user notification preferences (enable/disable specific channels).

Track delivery status (Sent, Delivered, Failed).

Scope Control:

In-scope: Backend API, message queuing, preference validation, and 3rd-party integration.

Out-of-scope: Analytics dashboard for marketing, message template creation UI, in-app notification inbox (notification center).

Non-Functional Requirements

Scale: Support 1B messages/day with peak bursts of 10x average load.

Latency: End-to-end delivery under 1 second for Push notifications.

Availability: 99.99% availability; the system must not lose messages if a downstream provider is down.

Consistency: Eventual consistency for delivery status logs; preference updates must be reflected quickly.

Fault Tolerance: Automatic retries with jitter and circuit breaking for external APIs.

Security: Secure storage of device tokens and PII (Phone/Email) using encryption at rest.

Estimation

Traffic Estimation:

1B notifications / 86,400s

\approx

11,500 Average QPS.

Peak QPS (10x)

\approx

115,000 QPS.

Storage Estimation:

Notification Log: 100 bytes per record. 1B records/day = 100 GB/day.

30-day retention = 3 TB.

Bandwidth Estimation:

Average payload 1KB. 11.5k QPS * 1KB

\approx

11.5 MB/s (Inbound).

Outbound to providers is roughly equivalent.

Blueprint

Concise Summary: A microservices architecture centered around a distributed message queue to decouple ingestion from delivery, utilizing workers to interface with 3rd-party providers.

Major Components:

API Gateway: Entry point for authentication, rate limiting, and request validation.

Notification Service: Validates requests, fetches user preferences, and pushes events to the queue.

Redis Cache: Stores user settings and device tokens to avoid database hits for every message.

PostgreSQL: System of record for user preferences and notification metadata.

Kafka: High-throughput message bus for buffering and prioritizing notifications.

Notification Workers: Consumers that execute the actual delivery logic and handle retries.

Simplicity Audit: This design avoids complex stream processing or multi-region synchronization for the MVP, focusing instead on reliable queuing and worker scaling.

Architecture Decision Rationale:

Why this architecture?: Message queues provide the necessary "buffer" to protect the system against spikes and provider slowness.

Functional Satisfaction: Covers multi-channel support through polymorphic workers and respects preferences via the metadata check.

Non-functional Satisfaction: Kafka provides the required 100k+ QPS throughput and persistence for reliability.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery & Traffic Routing: Global Load Balancer (L7) using Round Robin or Least Connections to distribute traffic to the API Gateway.

Security & Perimeter:

API Gateway: Performs JWT validation and extracts user_id.

Rate Limiting: Tiered limits (e.g., 100 requests/sec for marketing bots, unlimited for internal system-critical alerts).

WAF: Standard protection against SQLi and XSS on the ingestion endpoints.

Service

Topology & Scaling: Stateless Notification Services deployed across multiple Availability Zones (AZs). Auto-scaling based on CPU and Request Count.

API Schema Design:

POST /v1/notifications

Protocol: REST/JSON.

Request: { "user_id": "123", "type": "ORDER_CONFIRM", "priority": "high", "content": { "title": "...", "body": "..." } }

Idempotency: Header X-Idempotency-Key required.

Resilience & Reliability:

Circuit Breaker: If Twilio returns 5xx errors above a threshold, the SMS Worker stops sending and redirects to a DLQ or secondary provider.

Timeouts: Aggressive timeouts (2s) for 3rd-party calls to prevent worker exhaustion.

Storage

Access Pattern:

High Read: Fetching user preferences and device tokens.

High Write: Notification delivery logs and status updates.

Database Table Design:

Users: user_id, email, phone_number, created_at.

UserPreferences: user_id, channel_type, is_enabled, updated_at.

DeviceTokens: user_id, platform (iOS/Android), token.

Technical Selection:

PostgreSQL: Chosen for ACID compliance on user settings. Use Partitioning by user_id for scale.

Distribution Logic: Sharding by user_id to ensure all data for a single user resides on one shard, simplifying preference lookups.

Cache

Purpose & Justification: Reduces latency for preference lookups (from ~50ms DB to ~2ms Cache) and protects the DB from 100k QPS.

Key-Value Schema:

pref:{user_id} -> JSON blob of preferences.

tokens:{user_id} -> List of active device tokens.

TTL: 24 hours with LRU eviction.

Technical Selection: Redis (Cluster Mode) for high availability and sub-millisecond latency.

Messaging

Purpose & Decoupling: Kafka acts as the buffer. It decouples the API (ingestion) from the slow 3rd-party networks.

Throughput & Partitioning:

Topics: notif.push, notif.sms, notif.email.

Partition Key: user_id (ensures order of messages for a single user).

Failure Handling:

Retry Queue: Messages that fail due to transient errors (429, 503) are moved to a retry topic with a delay.

DLQ: Messages that fail after 5 retries or have invalid payloads.

Data Processing

Processing Model: Stream processing via Notification Workers (Go or Java for high concurrency).

Processing Logic:

Consume message from Kafka.

Double-check cache for most recent opt-out status.

Format payload using a local template engine.

Call 3rd-party SDK.

On success, update status log (async).

Technical Selection: Custom Go-based workers for low memory footprint and excellent concurrency (Goroutines) for handling blocking I/O.

Infrastructure (Optional)

Observability:

Metrics: Track "Delivery Latency" (End-to-end) and "Provider Error Rate".

Logging: Structured logs with trace_id propagated from the API Gateway to the Workers.

Wrap Up

Advanced Topics

Trade-offs:

Consistency vs. Availability: We choose Eventual Consistency for delivery logs to ensure the system remains available for sending notifications.

At-least-once Delivery: We accept that a user might rarely receive a duplicate notification if a worker crashes after sending to a provider but before committing the Kafka offset.

Bottleneck Analysis:

Hot Partitions: A celebrity "push all" could hit a single Kafka partition. Optimization: Use a "Broadcast Topic" with no partition key (Round Robin) for large-scale marketing blasts.

Security:

PII (Phone/Email) must be encrypted using AES-256 before being stored in PostgreSQL.

Workers use short-lived IAM roles to access provider secrets in HashiCorp Vault or AWS Secrets Manager.

Optimization:

Batching: Email providers often support batch APIs (send to 1000 users in one call). Workers should aggregate small messages before calling the provider.