DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
The Question
Design

Scalable Cross-Channel Notification System

Design a high-throughput notification system supporting Push, SMS, and Email for 100 million daily active users. The system must handle 1 billion notifications daily, respect user-specific preference settings, and ensure reliable delivery despite unreliable 3rd-party provider APIs. Address how you would handle traffic spikes (e.g., breaking news) and ensure at-least-once delivery semantics while minimizing latency.
Kafka
Redis
PostgreSQL
API Gateway
Circuit Breaker
gRPC
Docker
Kubernetes
AES-256
Questions & Insights

Clarifying Questions

Scale & Volume: What is the expected daily active user count and the peak notification volume (e.g., a breaking news event)?
Assumption: 100M DAU, 1 Billion notifications per day, peak QPS of 100k.
Channels: Which delivery channels are required for the MVP?
Assumption: Mobile Push (iOS/Android), SMS, and Email.
Latency & Reliability: What are the latency requirements and the acceptable delivery guarantee?
Assumption: Near real-time (< 1s for 99th percentile push), At-least-once delivery guarantee.
Personalization: Do we need to handle user preferences (opt-outs, quiet hours)?
Assumption: Yes, the system must respect user-defined channel preferences and global opt-outs.
Priority: Should the system distinguish between transactional (OTP) and marketing (Newsletter) notifications?
Assumption: Yes, high-priority notifications should bypass marketing queues.

Thinking Process

Core Bottleneck: The primary challenge is the external dependency on 3rd-party providers (APNS, FCM, Twilio, SendGrid) which are prone to throttling and intermittent failures.
Strategy:
How do we decouple request ingestion from delivery to handle bursts? Use a persistent Message Queue.
How do we prevent 3rd-party failures from taking down our system? Implement Circuit Breakers and Exponential Backoff.
How do we ensure low latency for user preferences? Use a Cache-aside pattern for user settings.
How do we handle scale? Partition the workers by notification type or channel.

Bonus Points

Idempotency Keys: Use unique notification IDs generated at the edge to prevent duplicate deliveries during retries (Distributed deduplication).
Provider Agility: Implement an abstraction layer for 3rd-party providers to allow dynamic failover (e.g., if Twilio is down, switch to Vonage).
Smart Throttling: Implement token-bucket rate limiting per user and per provider to comply with downstream SLAs and prevent "spamming" users.
Dead Letter Queues (DLQ): Sophisticated handling of "poison pills" and exhausted retries for auditability and manual intervention.
Design Breakdown

Functional Requirements

Core Use Cases:
Send a notification to a specific user via Push, SMS, or Email.
Manage user notification preferences (enable/disable specific channels).
Track delivery status (Sent, Delivered, Failed).
Scope Control:
In-scope: Backend API, message queuing, preference validation, and 3rd-party integration.
Out-of-scope: Analytics dashboard for marketing, message template creation UI, in-app notification inbox (notification center).

Non-Functional Requirements

Scale: Support 1B messages/day with peak bursts of 10x average load.
Latency: End-to-end delivery under 1 second for Push notifications.
Availability: 99.99% availability; the system must not lose messages if a downstream provider is down.
Consistency: Eventual consistency for delivery status logs; preference updates must be reflected quickly.
Fault Tolerance: Automatic retries with jitter and circuit breaking for external APIs.
Security: Secure storage of device tokens and PII (Phone/Email) using encryption at rest.

Estimation

Traffic Estimation:
1B notifications / 86,400s \approx 11,500 Average QPS.
Peak QPS (10x) \approx 115,000 QPS.
Storage Estimation:
Notification Log: 100 bytes per record. 1B records/day = 100 GB/day.
30-day retention = 3 TB.
Bandwidth Estimation:
Average payload 1KB. 11.5k QPS * 1KB \approx 11.5 MB/s (Inbound).
Outbound to providers is roughly equivalent.

Blueprint

Concise Summary: A microservices architecture centered around a distributed message queue to decouple ingestion from delivery, utilizing workers to interface with 3rd-party providers.
Major Components:
API Gateway: Entry point for authentication, rate limiting, and request validation.
Notification Service: Validates requests, fetches user preferences, and pushes events to the queue.
Redis Cache: Stores user settings and device tokens to avoid database hits for every message.
PostgreSQL: System of record for user preferences and notification metadata.
Kafka: High-throughput message bus for buffering and prioritizing notifications.
Notification Workers: Consumers that execute the actual delivery logic and handle retries.
Simplicity Audit: This design avoids complex stream processing or multi-region synchronization for the MVP, focusing instead on reliable queuing and worker scaling.
Architecture Decision Rationale:
Why this architecture?: Message queues provide the necessary "buffer" to protect the system against spikes and provider slowness.
Functional Satisfaction: Covers multi-channel support through polymorphic workers and respects preferences via the metadata check.
Non-functional Satisfaction: Kafka provides the required 100k+ QPS throughput and persistence for reliability.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery & Traffic Routing: Global Load Balancer (L7) using Round Robin or Least Connections to distribute traffic to the API Gateway.
Security & Perimeter:
API Gateway: Performs JWT validation and extracts user_id.
Rate Limiting: Tiered limits (e.g., 100 requests/sec for marketing bots, unlimited for internal system-critical alerts).
WAF: Standard protection against SQLi and XSS on the ingestion endpoints.

Service

Topology & Scaling: Stateless Notification Services deployed across multiple Availability Zones (AZs). Auto-scaling based on CPU and Request Count.
API Schema Design:
POST /v1/notifications
Protocol: REST/JSON.
Request: { "user_id": "123", "type": "ORDER_CONFIRM", "priority": "high", "content": { "title": "...", "body": "..." } }
Idempotency: Header X-Idempotency-Key required.
Resilience & Reliability:
Circuit Breaker: If Twilio returns 5xx errors above a threshold, the SMS Worker stops sending and redirects to a DLQ or secondary provider.
Timeouts: Aggressive timeouts (2s) for 3rd-party calls to prevent worker exhaustion.

Storage

Access Pattern:
High Read: Fetching user preferences and device tokens.
High Write: Notification delivery logs and status updates.
Database Table Design:
Users: user_id, email, phone_number, created_at.
UserPreferences: user_id, channel_type, is_enabled, updated_at.
DeviceTokens: user_id, platform (iOS/Android), token.
Technical Selection:
PostgreSQL: Chosen for ACID compliance on user settings. Use Partitioning by user_id for scale.
Distribution Logic: Sharding by user_id to ensure all data for a single user resides on one shard, simplifying preference lookups.

Cache

Purpose & Justification: Reduces latency for preference lookups (from ~50ms DB to ~2ms Cache) and protects the DB from 100k QPS.
Key-Value Schema:
pref:{user_id} -> JSON blob of preferences.
tokens:{user_id} -> List of active device tokens.
TTL: 24 hours with LRU eviction.
Technical Selection: Redis (Cluster Mode) for high availability and sub-millisecond latency.

Messaging

Purpose & Decoupling: Kafka acts as the buffer. It decouples the API (ingestion) from the slow 3rd-party networks.
Throughput & Partitioning:
Topics: notif.push, notif.sms, notif.email.
Partition Key: user_id (ensures order of messages for a single user).
Failure Handling:
Retry Queue: Messages that fail due to transient errors (429, 503) are moved to a retry topic with a delay.
DLQ: Messages that fail after 5 retries or have invalid payloads.

Data Processing

Processing Model: Stream processing via Notification Workers (Go or Java for high concurrency).
Processing Logic:
Consume message from Kafka.
Double-check cache for most recent opt-out status.
Format payload using a local template engine.
Call 3rd-party SDK.
On success, update status log (async).
Technical Selection: Custom Go-based workers for low memory footprint and excellent concurrency (Goroutines) for handling blocking I/O.

Infrastructure (Optional)

Observability:
Metrics: Track "Delivery Latency" (End-to-end) and "Provider Error Rate".
Logging: Structured logs with trace_id propagated from the API Gateway to the Workers.
Wrap Up

Advanced Topics

Trade-offs:
Consistency vs. Availability: We choose Eventual Consistency for delivery logs to ensure the system remains available for sending notifications.
At-least-once Delivery: We accept that a user might rarely receive a duplicate notification if a worker crashes after sending to a provider but before committing the Kafka offset.
Bottleneck Analysis:
Hot Partitions: A celebrity "push all" could hit a single Kafka partition. Optimization: Use a "Broadcast Topic" with no partition key (Round Robin) for large-scale marketing blasts.
Security:
PII (Phone/Email) must be encrypted using AES-256 before being stored in PostgreSQL.
Workers use short-lived IAM roles to access provider secrets in HashiCorp Vault or AWS Secrets Manager.
Optimization:
Batching: Email providers often support batch APIs (send to 1000 users in one call). Workers should aggregate small messages before calling the provider.