DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
The Question
Design

Scalable Multi-Channel Notification System

Design a highly available and scalable notification system capable of delivering millions of messages across Push (iOS/Android), SMS, and Email. The system must handle varying priorities (e.g., OTP vs Marketing), ensure at-least-once delivery, manage user preferences, and gracefully handle third-party provider failures or rate limits. Detail the architectural choices for decoupling, message persistence, and handling high-concurrency peak loads.
Kafka
Redis
PostgreSQL
Cassandra
APNs
FCM
Twilio
SendGrid
JWT
OAuth2
gRPC
Flink
Questions & Insights

Clarifying Questions

Scale and Throughput: What is the expected volume of notifications per day? Are there specific peak periods (e.g., marketing blasts or breaking news)?
Latency SLAs: Are there different priority levels for notifications? For example, should an OTP (One-Time Password) have a lower latency requirement than a weekly newsletter?
Delivery Guarantees: Is "exactly-once" delivery required, or is "at-least-once" acceptable? (Usually, at-least-once is standard for notifications to avoid the overhead of distributed transactions).
Third-Party Providers: Do we have specific vendors (e.g., Twilio for SMS, SendGrid for Email) or should the system be provider-agnostic to allow for failover?
User Preferences: Does the system need to manage user opt-in/opt-out and preferred channels (e.g., "only send SMS if push fails")?
Assumptions:
Scale: 100 million notifications per day.
Latency: High-priority messages (OTP) < 5 seconds; Low-priority < 30 minutes.
Reliability: At-least-once delivery is the primary goal.
Providers: We will use APNs (iOS), FCM (Android), Twilio (SMS), and SendGrid (Email).

Thinking Process

Decoupling is King: How do we prevent a slow third-party provider (like a lagging SMS gateway) from backing up the entire system?
Priority Queueing: How do we ensure a massive marketing blast doesn't delay critical account security alerts?
Idempotency & Deduplication: How do we prevent sending the same notification twice if a worker retries a failed task?
Rate Limiting & Throttling: How do we protect both our internal services and our users from notification fatigue?

Bonus Points

Smart Retries with Exponential Backoff: Implementing jittered backoff logic specifically tuned to different error types (e.g., 429 Too Many Requests vs. 500 Internal Server Error).
Provider Failover Strategy: Real-time switching between SMS/Email providers based on health checks and delivery success rates.
Fan-out Optimization: Handling notifications for "Million-follower" accounts using a hybrid push/pull model or specialized shard routing.
Message Deduplication via Bloom Filters: Using memory-efficient data structures to check for duplicate requests at the entry point for high-velocity streams.
Design Breakdown

Functional Requirements

Core Use Cases:
Send push notifications to iOS and Android devices.
Send SMS messages to mobile numbers.
Send emails to user addresses.
Support template-based message generation.
Provide an endpoint for users to manage notification preferences (opt-in/out).
Scope Control:
In-scope: API for ingestion, prioritization, channel-specific workers, and basic delivery tracking.
Out-of-scope: Complex analytics/click-through tracking, advanced user segmentation/targeting (marketing engine), and in-app message inbox UI.

Non-Functional Requirements

Scale: Support 100M+ notifications/day with peaks of 10k QPS.
Latency: P99 < 5 seconds for high-priority messages.
Availability & Reliability: 99.99% availability; messages should not be lost once accepted by the API.
Consistency: Eventual consistency for user preference updates.
Fault Tolerance: Handle third-party provider outages gracefully via retries and queues.
Security: Secure storage of device tokens and PII (Phone numbers/Emails).

Estimation

Traffic Estimation:
100M notifications / 86,400 seconds \approx 1,150 average QPS.
Peak QPS (10x) \approx 11,500 QPS.
Storage Estimation:
Metadata (User ID, Token, Status) \approx 500 bytes per notification.
100M * 500 bytes = 50 GB/day.
Retain logs for 30 days = 1.5 TB.
Bandwidth Estimation:
Inbound: 11,500 QPS * 1 KB/request \approx 11.5 MB/s.
Outbound: Similar, plus overhead for provider-specific payloads.

Blueprint

Concise Summary: A microservices-based architecture utilizing a distributed message queue (Kafka) to decouple ingestion from delivery. Workers consume from prioritized topics and dispatch to third-party providers.
Major Components:
Notification API: The entry point for all notification requests, performing validation and authentication.
Task Queue (Kafka): Acts as a buffer and provides prioritization by separating high and low priority traffic into different topics.
Workers: Scalable consumer groups that handle template rendering and provider-specific API calls (APNs, FCM, etc.).
Metadata Store: Persists user preferences, device tokens, and notification status.
Simplicity Audit: This design avoids complex real-time stream processing or custom delivery protocols, relying on industry-standard providers and proven queueing patterns.
Architecture Decision Rationale:
Why this architecture?: Queues are essential because third-party providers have unpredictable latencies and rate limits. Kafka allows for high throughput and replayability if a worker group fails.
Functional Satisfaction: Covers all 4 channels (iOS, Android, SMS, Email) through specialized worker logic.
Non-functional Satisfaction: Scalable horizontally; provides fault tolerance through persistent queueing.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery & Traffic Routing: Not heavily utilized for notification sending, but the API Gateway uses Geo-DNS to route requests to the nearest regional data center.
Security & Perimeter:
API Gateway: Handles SSL termination and OAuth2/JWT validation.
Rate Limiting: Implemented at the API Gateway to prevent any single internal service from overwhelming the notification system (e.g., a buggy script sending millions of emails).

Service

Topology & Scaling: Stateless Notification Services deployed in multiple Availability Zones (AZs). Scaling is based on Request-per-Second (RPS) and CPU.
API Schema Design:
POST /v1/notifications
Protocol: REST/JSON
Request: { "user_id": "string", "priority": "high/low", "content": { "title": "...", "body": "..." }, "channels": ["push", "sms"] }
Idempotency: Supports X-Idempotency-Key header to prevent duplicate sends on client retries.
Resilience & Reliability:
Circuit Breaker: If Twilio returns 5xx errors consistently, the SMS Worker trips the circuit and potentially routes to a backup provider.

Storage

Access Pattern:
Heavily Read-optimized for user preferences and device tokens (looked up for every notification).
Heavily Write-optimized for delivery logs.
Database Table Design:
UserDevices: user_id (PK), device_token, platform (ios/android), last_active.
UserPreferences: user_id (PK), channel_type, enabled (bool).
Technical Selection:
PostgreSQL: Stores user preferences and device tokens. Relational integrity ensures users aren't messaged on deleted accounts.
Cassandra/ClickHouse: Used for Delivery Logs to handle high-volume write throughput for auditing.
Distribution Logic: Sharded by user_id to ensure all data for a single user resides on the same partition.

Cache

Purpose & Justification: Reduce DB load for frequent lookups of device tokens and rate-limit counters.
Key-Value Schema:
Key: token:{user_id}, Value: List<DeviceTokens>, TTL: 24h (updated on app launch).
Key: rl:{user_id}:{channel}, Value: Count, TTL: 1h.
Technical Selection: Redis. High performance and supports complex data types like Sets for tokens.

Messaging

Purpose & Decoupling: Decouples the notification request from the slow process of network I/O with third-party vendors.
Event / Topic Schema:
Topics: notification.high_priority, notification.low_priority, notification.sms, notification.email.
Throughput & Partitioning: Partitioned by user_id to maintain message ordering per user (e.g., "Order Placed" must arrive before "Order Delivered").
Technical Selection: Kafka. Chosen for its high-throughput capabilities and durability.

Data Processing

Processing Model: Stream processing for real-time delivery tracking and aggregation.
Processing DAG: Provider Webhook -> Kafka (Status Topic) -> Flink/Spark -> Log DB.
Technical Selection: Flink (Optional for MVP, can be simple Python/Go workers). For MVP, we use Worker Services consuming from Kafka.

Infrastructure (Optional)

Observability:
Metrics: Monitor "Queue Depth" (crucial for detecting bottlenecks) and "Provider Error Rates".
Tracing: Jaeger/OpenTelemetry to trace a notification from API call to provider submission.
Wrap Up

Advanced Topics

Trade-offs: We chose At-least-once over Exactly-once. Exactly-once requires expensive two-phase commits or heavy distributed locking, which would kill throughput. We handle duplicates at the consumer level using idempotency keys.
Reliability: If a provider is down, messages stay in Kafka. We can add a "Dead Letter Queue" (DLQ) for messages that fail after 5 retries for manual inspection.
Bottleneck Analysis: The biggest bottleneck is usually the third-party provider's rate limits. We implement Client-side Throttling in our workers to stay within Twilio/FCM quotas.
Security: Device tokens and PII are encrypted at rest (AES-256) and in transit (TLS 1.3).
Distinguishing Insights:
Smart Batching: For low-priority emails, workers can batch messages to SendGrid to reduce the number of API calls and improve throughput.
Adaptive Throttling: The system can dynamically lower the ingestion rate if it detects a global spike in provider latency.