The Question

Scalable Multi-Channel Notification System

Design a highly available and scalable notification system capable of delivering millions of messages across Push (iOS/Android), SMS, and Email. The system must handle varying priorities (e.g., OTP vs Marketing), ensure at-least-once delivery, manage user preferences, and gracefully handle third-party provider failures or rate limits. Detail the architectural choices for decoupling, message persistence, and handling high-concurrency peak loads.

Kafka

Redis

PostgreSQL

Cassandra

APNs

FCM

Twilio

SendGrid

JWT

OAuth2

gRPC

Flink

Questions & Insights

Clarifying Questions

Scale and Throughput: What is the expected volume of notifications per day? Are there specific peak periods (e.g., marketing blasts or breaking news)?

Latency SLAs: Are there different priority levels for notifications? For example, should an OTP (One-Time Password) have a lower latency requirement than a weekly newsletter?

Delivery Guarantees: Is "exactly-once" delivery required, or is "at-least-once" acceptable? (Usually, at-least-once is standard for notifications to avoid the overhead of distributed transactions).

Third-Party Providers: Do we have specific vendors (e.g., Twilio for SMS, SendGrid for Email) or should the system be provider-agnostic to allow for failover?

User Preferences: Does the system need to manage user opt-in/opt-out and preferred channels (e.g., "only send SMS if push fails")?

Assumptions:

Scale: 100 million notifications per day.

Latency: High-priority messages (OTP) < 5 seconds; Low-priority < 30 minutes.

Reliability: At-least-once delivery is the primary goal.

Providers: We will use APNs (iOS), FCM (Android), Twilio (SMS), and SendGrid (Email).

Thinking Process

Decoupling is King: How do we prevent a slow third-party provider (like a lagging SMS gateway) from backing up the entire system?

Priority Queueing: How do we ensure a massive marketing blast doesn't delay critical account security alerts?

Idempotency & Deduplication: How do we prevent sending the same notification twice if a worker retries a failed task?

Rate Limiting & Throttling: How do we protect both our internal services and our users from notification fatigue?

Bonus Points

Smart Retries with Exponential Backoff: Implementing jittered backoff logic specifically tuned to different error types (e.g., 429 Too Many Requests vs. 500 Internal Server Error).

Provider Failover Strategy: Real-time switching between SMS/Email providers based on health checks and delivery success rates.

Fan-out Optimization: Handling notifications for "Million-follower" accounts using a hybrid push/pull model or specialized shard routing.

Message Deduplication via Bloom Filters: Using memory-efficient data structures to check for duplicate requests at the entry point for high-velocity streams.

Design Breakdown

Functional Requirements

Core Use Cases:

Send push notifications to iOS and Android devices.

Send SMS messages to mobile numbers.

Send emails to user addresses.

Support template-based message generation.

Provide an endpoint for users to manage notification preferences (opt-in/out).

Scope Control:

In-scope: API for ingestion, prioritization, channel-specific workers, and basic delivery tracking.

Out-of-scope: Complex analytics/click-through tracking, advanced user segmentation/targeting (marketing engine), and in-app message inbox UI.

Non-Functional Requirements

Scale: Support 100M+ notifications/day with peaks of 10k QPS.

Latency: P99 < 5 seconds for high-priority messages.

Availability & Reliability: 99.99% availability; messages should not be lost once accepted by the API.

Consistency: Eventual consistency for user preference updates.

Fault Tolerance: Handle third-party provider outages gracefully via retries and queues.

Security: Secure storage of device tokens and PII (Phone numbers/Emails).

Estimation

Traffic Estimation:

100M notifications / 86,400 seconds

\approx

1,150 average QPS.

Peak QPS (10x)

\approx

11,500 QPS.

Storage Estimation:

Metadata (User ID, Token, Status)

\approx

500 bytes per notification.

100M * 500 bytes = 50 GB/day.

Retain logs for 30 days = 1.5 TB.

Bandwidth Estimation:

Inbound: 11,500 QPS * 1 KB/request

\approx

11.5 MB/s.

Outbound: Similar, plus overhead for provider-specific payloads.

Blueprint

Concise Summary: A microservices-based architecture utilizing a distributed message queue (Kafka) to decouple ingestion from delivery. Workers consume from prioritized topics and dispatch to third-party providers.

Major Components:

Notification API: The entry point for all notification requests, performing validation and authentication.

Task Queue (Kafka): Acts as a buffer and provides prioritization by separating high and low priority traffic into different topics.

Workers: Scalable consumer groups that handle template rendering and provider-specific API calls (APNs, FCM, etc.).

Metadata Store: Persists user preferences, device tokens, and notification status.

Simplicity Audit: This design avoids complex real-time stream processing or custom delivery protocols, relying on industry-standard providers and proven queueing patterns.

Architecture Decision Rationale:

Why this architecture?: Queues are essential because third-party providers have unpredictable latencies and rate limits. Kafka allows for high throughput and replayability if a worker group fails.

Functional Satisfaction: Covers all 4 channels (iOS, Android, SMS, Email) through specialized worker logic.

Non-functional Satisfaction: Scalable horizontally; provides fault tolerance through persistent queueing.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery & Traffic Routing: Not heavily utilized for notification sending, but the API Gateway uses Geo-DNS to route requests to the nearest regional data center.

Security & Perimeter:

API Gateway: Handles SSL termination and OAuth2/JWT validation.

Rate Limiting: Implemented at the API Gateway to prevent any single internal service from overwhelming the notification system (e.g., a buggy script sending millions of emails).

Service

Topology & Scaling: Stateless Notification Services deployed in multiple Availability Zones (AZs). Scaling is based on Request-per-Second (RPS) and CPU.

API Schema Design:

POST /v1/notifications

Protocol: REST/JSON

Request:

{ "user_id": "string", "priority": "high/low", "content": { "title": "...", "body": "..." }, "channels": ["push", "sms"] }

Idempotency: Supports X-Idempotency-Key header to prevent duplicate sends on client retries.

Resilience & Reliability:

Circuit Breaker: If Twilio returns 5xx errors consistently, the SMS Worker trips the circuit and potentially routes to a backup provider.

Storage

Access Pattern:

Heavily Read-optimized for user preferences and device tokens (looked up for every notification).

Heavily Write-optimized for delivery logs.

Database Table Design:

UserDevices: user_id (PK), device_token, platform (ios/android), last_active.

UserPreferences: user_id (PK), channel_type, enabled (bool).

Technical Selection:

PostgreSQL: Stores user preferences and device tokens. Relational integrity ensures users aren't messaged on deleted accounts.

Cassandra/ClickHouse: Used for Delivery Logs to handle high-volume write throughput for auditing.

Distribution Logic: Sharded by user_id to ensure all data for a single user resides on the same partition.

Cache

Purpose & Justification: Reduce DB load for frequent lookups of device tokens and rate-limit counters.

Key-Value Schema:

Key: token:{user_id}, Value: List<DeviceTokens>, TTL: 24h (updated on app launch).

Key: rl:{user_id}:{channel}, Value: Count, TTL: 1h.

Technical Selection: Redis. High performance and supports complex data types like Sets for tokens.

Messaging

Purpose & Decoupling: Decouples the notification request from the slow process of network I/O with third-party vendors.

Event / Topic Schema:

Topics: notification.high_priority, notification.low_priority, notification.sms, notification.email.

Throughput & Partitioning: Partitioned by user_id to maintain message ordering per user (e.g., "Order Placed" must arrive before "Order Delivered").

Technical Selection: Kafka. Chosen for its high-throughput capabilities and durability.

Data Processing

Processing Model: Stream processing for real-time delivery tracking and aggregation.

Processing DAG: Provider Webhook -> Kafka (Status Topic) -> Flink/Spark -> Log DB.

Technical Selection: Flink (Optional for MVP, can be simple Python/Go workers). For MVP, we use Worker Services consuming from Kafka.

Infrastructure (Optional)

Observability:

Metrics: Monitor "Queue Depth" (crucial for detecting bottlenecks) and "Provider Error Rates".

Tracing: Jaeger/OpenTelemetry to trace a notification from API call to provider submission.

Wrap Up

Advanced Topics

Trade-offs: We chose At-least-once over Exactly-once. Exactly-once requires expensive two-phase commits or heavy distributed locking, which would kill throughput. We handle duplicates at the consumer level using idempotency keys.

Reliability: If a provider is down, messages stay in Kafka. We can add a "Dead Letter Queue" (DLQ) for messages that fail after 5 retries for manual inspection.

Bottleneck Analysis: The biggest bottleneck is usually the third-party provider's rate limits. We implement Client-side Throttling in our workers to stay within Twilio/FCM quotas.

Security: Device tokens and PII are encrypted at rest (AES-256) and in transit (TLS 1.3).

Distinguishing Insights:

Smart Batching: For low-priority emails, workers can batch messages to SendGrid to reduce the number of API calls and improve throughput.

Adaptive Throttling: The system can dynamically lower the ingestion rate if it detects a global spike in provider latency.