The Question

Scalable Multi-Channel Notification Engine

Design a high-throughput notification system capable of delivering millions of messages daily across Push, SMS, and Email. The system must handle transactional spikes (e.g., OTPs) with high priority while managing high-volume marketing blasts. Key requirements include at-least-once delivery guarantees, robust retry mechanisms for 3rd party provider failures, user preference management, and strict idempotency to prevent duplicate notifications. Address how the architecture handles provider outages and ensures low-latency delivery during peak loads.

Kafka

Redis

DynamoDB

Kubernetes

API Gateway

Twilio

SendGrid

FCM

APNs

Questions & Insights

Clarifying Questions

What is the expected scale of the system? (Daily Active Users, Notifications per day/peak).

What are the delivery guarantees? (Is at-least-once delivery acceptable, or is exactly-once required for specific cases like billing?)

Does the system need to support prioritization? (e.g., Should an OTP (One Time Password) bypass a marketing blast?)

How should we handle user preferences and opt-outs? (Are these managed within the system or by an external User Service?)

What is the latency SLA? (Target time from API call to delivery to the third-party provider).

Assumptions for MVP:

Scale: 100M notifications per day, peak 5k QPS.

Reliability: At-least-once delivery.

Priority: High-priority queue for transactional messages (OTPs) and Low-priority for marketing.

Providers: Integration with Twilio (SMS), SendGrid (Email), and FCM/APNs (Push).

Thinking Process

To build a scalable and reliable notification system, we must move from a synchronous request-response model to an asynchronous event-driven architecture.

Decoupling: How do we prevent 3rd party provider latency from taking down our internal services? (Use Message Queues).

Rate Limiting & Throttling: How do we protect our providers and our own downstream workers? (Implement token buckets at the gateway and consumer levels).

Retry Logic: How do we handle transient failures (5xx) versus permanent failures (4xx) from providers? (Exponential backoff with Dead Letter Queues).

Idempotency: How do we prevent sending the same notification twice during a retry? (Idempotency keys and distributed locking/caching).

Bonus Points

Smart Provider Routing: Implement a "Weighted Round Robin" or "Least-Cost Routing" strategy to switch between multiple providers (e.g., Twilio vs. Plivo) based on health, cost, or deliverability metrics.

Dynamic Content Templating: Decouple message logic from the payload using a template engine (e.g., Handlebars/Mustache) stored in S3, allowing marketing teams to update copy without code deployments.

Feedback Loop Integration: Consume webhooks from providers (Delivered, Bounced, Opened) to update notification status in real-time and automatically blacklist "dead" emails/numbers.

Compliance & Privacy: Implement PII scrubbing in logs and regional data residency (e.g., GDPR requirements for EU users).

Design Breakdown

Functional Requirements

Core Use Cases:

Support Push, Email, and SMS notifications.

Send transactional (high priority) and marketing (low priority) messages.

Allow users to opt-in/out of specific notification categories.

Track delivery status (Sent, Delivered, Failed).

Scope Control:

In-scope: API for triggering notifications, template management, provider integration, and retry logic.

Out-of-scope: In-app notification UI/Inbox, user profile management (assumed external), and deep analytics/ML for marketing optimization.

Non-Functional Requirements

Scale: Support horizontal scaling to handle 10k+ peak QPS.

Latency: 99th percentile delivery to providers under 500ms (excluding provider network latency).

Availability & Reliability: 99.99% uptime; persistent storage of notification logs to ensure no data loss.

Consistency: Eventual consistency for delivery status updates.

Fault Tolerance: Circuit breakers for third-party API dependencies.

Estimation

Traffic: 100M notifications/day

\approx

1,200 avg QPS. Peak QPS (4x)

\approx

5,000.

Storage:

Each log entry

\approx

500 bytes.

100M entries/day

\times

30 days retention = 1.5 TB.

Bandwidth:

Ingress: 5,000 QPS

\times

1 KB payload = 5 MB/s.

Egress: Similar bandwidth for outgoing calls to providers.

Blueprint

Concise Summary: An event-driven architecture using an API Gateway for entry, a distributed message queue (Kafka) for decoupling, and specialized worker clusters for different delivery channels.

Major Components:

Notification API Service: Validates requests, fetches user preferences, and produces events to Kafka.

Distributed Cache (Redis): Stores idempotency keys and rate-limiting counters to prevent duplicate sends.

Message Queue (Kafka): Acts as the backbone for load leveling and multi-priority queuing.

Delivery Workers: Channel-specific consumers (Push, SMS, Email) that handle the actual provider communication and retries.

Metadata Store (DynamoDB): Persists notification status and user opt-out settings.

Simplicity Audit: This design avoids complex stream processing engines (like Flink) in favor of simple consumer groups, which is sufficient for an MVP and easier to operate.

Architecture Decision Rationale:

Decoupling: Message queues ensure that if SendGrid is slow, SMS delivery is unaffected.

Scalability: Kafka and DynamoDB allow independent scaling of the ingestion and storage layers.

High Level Architecture

Sub-system Deep Dive

Service

Topology & Scaling: Stateless Notification Service deployed in Kubernetes. Autoscaling based on CPU/Request count.

API Schema Design:

POST /v1/send

Protocol: REST/JSON

Request: { "user_id": "u1", "channel": "SMS", "template_id": "otp_v1", "params": {...}, "idempotency_key": "uuid" }

Idempotency: Required header to prevent duplicates.

Resilience:

Circuit Breaker: Used in workers when calling Twilio/SendGrid to prevent resource exhaustion during provider outages.

Observability:

Prometheus metrics for "notification_sent_total" and "delivery_latency_ms".

Distributed tracing (Jaeger) to track a notification from the API to the provider.

Storage

Access Pattern: High write (logs), medium read (fetching preferences).

Database Table Design:

UserPreferences Table: user_id (PK), channel_type, is_enabled.

NotificationLogs Table: notification_id (PK), user_id, status (PENDING, SENT, FAILED), retry_count, created_at.

Technical Selection: DynamoDB.

Rationale: High write throughput, schema flexibility, and automatic TTL for old logs (to keep costs down).

Distribution Logic: Sharded by user_id to avoid hot partitions for a single user's notification history.

Cache

Purpose: Deduplication and Rate Limiting.

Key-Value Schema:

dedup:{idempotency_key} -> bool (TTL 24h).

rate_limit:{user_id}:{channel} -> counter.

Technical Selection: Redis.

Rationale: Low latency for checking idempotency before pushing to Kafka.

Messaging

Purpose: Decoupling and Load Leveling.

Event Schema: Kafka topics partitioned by notification_id.

Topics: push_notifications, sms_notifications, email_notifications.

Dedicated priority_high and priority_low partitions.

Failure Handling:

DLQ: If a message fails after 3 retries, it is moved to a notification_dlq for manual inspection.

Technical Selection: Kafka.

Rationale: High throughput and log retention for replayability.

Wrap Up

Advanced Topics

Trade-offs: We chose Eventual Consistency for delivery status. A user might not see the "Sent" status in their history immediately, but the delivery speed is prioritized.

Reliability:

Exponential Backoff: Workers retry provider calls (e.g., 1s, 2s, 4s, 8s).

Multi-Provider Failover: If Twilio returns a 5xx consistently, the SMS worker can automatically switch to a secondary provider like Nexmo.

Bottleneck Analysis:

Hot Partitions: If a "System Broadcast" hits 10M users, Kafka partitions might lag. Mitigation: Increase partition count and use random sharding keys for broadcast messages.

Security:

PII Encryption: Encrypt phone numbers and emails at rest in the logs.

TLS: Mandatory for all provider API communication.

Distinguishing Insights: For a staff-level design, I would include a Template Service that fetches templates from a versioned S3 bucket. This allows the system to send template_id in the message queue instead of the full message body, reducing the payload size significantly.