The Question

Scalable Notification System Design

Design a global-scale notification system capable of handling 1 billion notifications per day across Push, SMS, and Email. The system must support high-priority alerts (e.g., OTPs) with sub-second latency and bulk marketing messages. Your design should address 3rd-party provider reliability, user preference management, rate limiting to prevent spam, and idempotency guarantees. Discuss how you would handle massive traffic spikes and ensure at-least-once delivery for every notification requested.

Kafka

Redis

PostgreSQL

gRPC

FCM

Twilio

SendGrid

DynamoDB

Prometheus

Grafana

Questions & Insights

Clarifying Questions

What types of notifications are supported? (e.g., Push for iOS/Android, SMS, Email, or In-app?)

What is the expected scale? (e.g., Daily Active Users (DAU), total notifications per day, and peak throughput during events like "Flash Sales").

Is delivery guaranteed or best-effort? (e.g., At-least-once delivery vs. strictly exactly-once which is harder at scale).

Are there priority levels? (e.g., High-priority OTP vs. Low-priority marketing newsletters).

Are we handling user preferences and opt-outs? (e.g., "Do not disturb" modes or channel-specific unsubscribes).

Assumptions for this design:

Scale: 100M DAU, 1 Billion notifications sent per day.

Channels: Push, SMS, and Email.

Priority: Support for high-priority (Real-time) and low-priority (Bulk) tiers.

Reliability: At-least-once delivery guarantee.

Latency: Sub-second for high-priority alerts.

Thinking Process

How do we decouple ingestion from delivery to handle massive bursts?

Introduce a distributed message queue (Kafka) to act as a buffer between the API and the delivery workers.

How do we prevent 3rd-party provider bottlenecks or outages from crashing our system?

Implement producer-side rate limiting and independent worker pools for each provider (FCM, Twilio, SendGrid) to isolate failures.

How do we ensure we don't spam users?

Implement a centralized "Frequency Capping" and "Preference Service" checked at the worker level before final dispatch.

How do we track the lifecycle of a notification?

Use a unique notification_id and a distributed log store to record states: Requested -> Queued -> Dispatched -> Delivered/Failed.

Bonus Points

Smart Provider Routing: Implement a dynamic routing layer that switches between providers (e.g., Twilio vs. Vonage for SMS) based on real-time success rates and cost-optimization.

Notification Deduplication: Use a sliding-window Bloom Filter or Redis-based idempotency keys to prevent duplicate notifications caused by client retries or upstream glitches.

Geo-localized Templates: Store notification templates in a globally distributed store to minimize latency when rendering notifications for different locales.

Backpressure-aware Scheduling: Implement a feedback loop where worker consumers slow down if 3rd-party provider APIs start returning 429 (Too Many Requests) or high latencies.

Design Breakdown

Functional Requirements

Core Use Cases:

API to send notifications (one-to-one and one-to-many).

Multi-channel support (Push, SMS, Email).

User preference management (Opt-in/out, DND).

Template management (Placeholders/Variables).

Scope Control:

In-scope: Backend architecture, delivery logic, rate limiting, and 3rd party integration.

Out-of-scope: Building the actual SMTP server or SMS gateway (use 3rd parties), UI/UX for notification center.

Non-Functional Requirements

Scale: Handle 10k+ Average QPS, 50k+ Peak QPS.

Latency: End-to-end delivery < 500ms for OTPs.

Availability & Reliability: 99.99% availability; zero notification loss (persistence in Kafka).

Consistency: Eventual consistency for delivery status updates.

Security & Privacy: Support for PII protection in message payloads; OAuth2 for internal API access.

Estimation

Traffic Estimation:

1 Billion notifications / 86,400 seconds ≈ 11,600 Avg QPS.

Peak QPS (3x Avg) ≈ 35,000 QPS.

Storage Estimation:

Notification Metadata: 100 bytes/record.

1 Billion records/day * 100 bytes ≈ 100 GB/day.

30-day retention ≈ 3 TB total storage.

Bandwidth Estimation:

Outgoing payload (avg 2KB including templates): 11,600 QPS * 2 KB ≈ 23.2 MB/s.

Blueprint

Concise Summary: A microservices-based architecture using a message broker to decouple notification requests from actual execution, ensuring high availability and resilience against slow 3rd-party providers.

Major Components:

Notification API: Entry point for internal services to trigger notifications.

Preference Service: Validates if the user wants to receive the notification.

Kafka (Messaging Layer): Durable buffer to store notifications categorized by priority.

Notification Workers: Consumers that fetch tasks, render templates, and call 3rd-party APIs.

PostgreSQL: Stores user settings and metadata.

Redis: Caches user preferences and handles rate limiting.

Simplicity Audit: This architecture avoids complex batch-processing engines or stream-processing frameworks in the MVP, relying on standard consumer-group patterns for horizontal scaling.

Architecture Decision Rationale:

Why this architecture?: Decoupling via Kafka allows the system to survive 3rd party outages and absorb traffic spikes without failing the caller.

Functional Requirement Satisfaction: Multi-channel is handled by specific worker pools; Preferences are enforced via a dedicated service.

Non-functional Requirement Satisfaction: Scalability is achieved by adding more Kafka partitions and worker instances; Reliability is handled by Kafka’s persistence.

High Level Architecture

Sub-system Deep Dive

Service

Topology & Scaling:

Notification Service: Stateless nodes deployed across multiple Availability Zones (AZs). Scaled based on CPU/Request count.

Worker Fleet: Scaled independently based on Kafka consumer lag metrics.

API Schema Design:

POST /v1/notifications

Protocol: gRPC (internal) or REST.

Request: { user_id, type, priority, template_id, context_data, idempotency_key }

Idempotency: idempotency_key stored in Redis for 24h to prevent duplicate sends.

Resilience & Reliability:

Retry Policy: Workers implement exponential backoff with jitter for 5xx errors from 3rd parties.

Circuit Breaker: If a 3rd party provider (e.g., Twilio) has a >50% failure rate, open the circuit and fail fast or route to a secondary provider.

Storage

Access Pattern:

Metadata DB: High write for logs, high read for preference checks.

Database Table Design:

User_Preferences: user_id (PK), channel, category, is_enabled, updated_at.

Notification_Logs: notification_id (PK), user_id, status (Queued/Sent/Failed), channel, created_at.

Technical Selection:

PostgreSQL: For User Preferences (Relational integrity, ACID).

DynamoDB/Cassandra (Optional for 10x Scale): For Notification_Logs to handle high-volume writes. For MVP, PostgreSQL with partitioning is sufficient.

Distribution Logic: Partition Notification_Logs by created_at (daily/monthly partitions) to facilitate efficient archiving and cleanup.

Cache

Purpose & Justification: Reduce DB load for frequent preference checks and implement rate-limiting counters.

Key-Value Schema:

User Prefs: user:prefs:{user_id} -> JSON blob. TTL 1 hour.

Rate Limit: rate:limit:{user_id}:{channel} -> Counter. TTL 1 minute.

Technical Selection: Redis. High performance, supports atomic increments for rate limiting.

Messaging

Purpose & Decoupling: Acts as a buffer and ensures delivery even if workers or 3rd parties are down.

Event / Topic Schema:

Topics: notification.high_priority, notification.bulk.

Partitioning Key: user_id (Ensures sequential ordering for the same user if needed).

Failure Handling:

Dead-letter Queue (DLQ): Messages that fail after N retries are moved to notification.dlq for manual inspection or secondary retry logic.

Technical Selection: Kafka. High throughput, persistent storage, and replay capabilities.

Infrastructure (Optional)

Observability:

Metrics: Track delivery_latency, error_rate_by_provider, queue_depth.

Tracing: Jaeger/OpenTelemetry to trace a notification from API call to 3rd party dispatch.

Wrap Up

Advanced Topics

Trade-offs (Availability vs. Consistency): We prioritize Availability (AP in CAP). If the Preference DB is down, we might fallback to "Send" or "Cached Prefs" to ensure critical notifications (like OTPs) are not lost, sacrificing strict preference consistency for a short window.

Bottleneck Analysis:

Hot Shards: If a celebrity sends a broadcast, millions of notifications hit the same topic. Mitigation: Use a "Fan-out" worker pattern where the broadcast is split into chunks of 1000 users.

3rd Party Latency: Slow response times from providers can hang workers. Mitigation: Use asynchronous HTTP clients (e.g., Netty or Go Routines) and strict timeouts.

Security: All payloads containing PII (Phone numbers, emails) should be encrypted at rest in the database and Kafka.