DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
The Question
Design

Scalable Notification System Design

Design a high-volume notification system capable of delivering 100M+ messages daily across Push, Email, and SMS. The system must handle high-priority transactional alerts (like OTPs) alongside lower-priority marketing campaigns, respect user opt-out preferences, and maintain high availability even during third-party provider outages. Address trade-offs between delivery latency, cost, and reliability.
Kafka
Redis
PostgreSQL
DynamoDB
Twilio
SendGrid
FCM
APNS
gRPC
Kubernetes
Questions & Insights

Clarifying Questions

Scale and Throughput: What is the expected volume of notifications per day, and what is the peak QPS?
Delivery Guarantees: Is "at-least-once" delivery sufficient, or do we require "exactly-once" (which is significantly harder with 3rd party providers)?
Latency SLAs: Are there strict requirements for delivery time (e.g., OTPs must arrive within 10 seconds)?
User Preferences: Should the system handle opt-in/opt-out settings and quiet hours?
Prioritization: Do we need to support different priority levels (e.g., transactional OTPs vs. marketing promos)?
Assumptions for MVP:
Scale: 100 million notifications per day.
Latency: Near real-time (< 5 seconds for 99% of notifications).
Channels: Push (FCM/APNS), Email (SendGrid/SES), SMS (Twilio).
Reliability: At-least-once delivery guarantee.
Prioritization: High-priority (OTP) vs. Low-priority (Marketing).

Thinking Process

How do we handle massive spikes without crashing downstream providers? Use a message queue to buffer requests and implement rate limiting per channel.
How do we ensure notifications aren't lost if a worker fails? Implement persistent queuing (Kafka/SQS) and a retry mechanism with exponential backoff.
How do we manage user channel preferences and templates? Centralize user settings in a high-read-optimized database and use a template engine to separate content from logic.
How do we track the status of a notification across various vendors? Implement a state machine and a feedback loop (webhooks) to track "Sent -> Delivered -> Opened" status.

Bonus Points

Vendor Agnostic Routing: Implement a "Provider Registry" to dynamically switch between providers (e.g., Twilio to Plivo) if one experiences an outage or higher costs.
Smart Throttling & Batching: Automatically batch non-urgent marketing emails to reduce connection overhead and respect provider rate limits.
Idempotency Keys: Use a unique hash of (UserID + NotificationType + Content) to prevent accidental double-sending within a short window.
Deliverability Analytics: Real-time monitoring of bounce rates and delivery failures to proactively detect if IPs are blacklisted by ISPs.
Design Breakdown

Functional Requirements

Core Use Cases:
Trigger notifications via internal API.
Support Push, Email, and SMS channels.
Manage and inject variables into message templates.
Respect user opt-out preferences and channel settings.
Scope Control:
In-scope: API for ingestion, queuing, worker delivery, and basic status tracking.
Out-of-scope: In-app notification center UI, complex AI-based scheduling, rich media editing tools.

Non-Functional Requirements

Scale: Support 1k+ average QPS with peaks up to 5k QPS.
Latency: Ingestion latency < 100ms; end-to-end delivery < 5s for high priority.
Availability & Reliability: 99.99% availability for the ingestion API; at-least-once delivery.
Consistency: Eventual consistency for user preference updates is acceptable.
Security & Privacy: Support PII encryption (e.g., phone numbers/emails) at rest and TLS in transit.

Estimation

Traffic: 100M notifications / 86400s \approx 1,200 avg QPS. Peak (10x) \approx 12,000 QPS.
Storage:
Metadata: 1KB per notification. 100M * 1KB = 100GB/day.
Retention: 30 days of logs = 3TB.
Bandwidth: 1,200 QPS * 1KB \approx 1.2MB/s (negligible for modern infra).

Blueprint

Concise Summary: A microservices-based architecture where an Ingestion API validates requests and pushes them to a distributed message queue (Kafka). Specialized workers consume from the queue, fetch user preferences/templates, and dispatch to 3rd party providers.
Major Components:
Notification API: Validates incoming requests and generates unique Notification IDs.
Kafka (Messaging Layer): Decouples ingestion from delivery and provides persistence for retries.
Notification Workers: Multi-threaded consumers that handle the heavy lifting of calling external APIs.
User & Template Store: Provides user preferences and pre-defined message templates.
Simplicity Audit: This design avoids complex distributed transactions by using an event-driven approach with a durable queue, satisfying the MVP's reliability needs without over-engineering.
Architecture Decision Rationale:
Why this architecture?: Message queues are the standard for high-volume notification systems to handle bursty traffic and provide fault tolerance.
Functional Satisfaction: Covers all channels and preferences via a modular worker/service approach.
Non-functional Satisfaction: Scalability is achieved by partitioning Kafka and horizontal scaling of workers.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery & Traffic Routing: Internal services call the API via regional Load Balancers.
Security & Perimeter:
API Gateway: Handles AuthN/AuthZ for internal services.
Rate Limiting: Applied at the service level to prevent a single internal "rogue" service from exhausting the notification quota.

Service

Topology & Scaling: Stateless microservices deployed in Kubernetes (K8s). Scale based on CPU and Kafka Consumer Lag.
API Schema Design:
POST /v1/notifications
Protocol: REST/gRPC
Request: { "user_id": "u1", "type": "ORDER_SHIPPED", "priority": "high", "payload": {...} }
Idempotency: idempotency_key header required.
Resilience & Reliability:
Retries: Workers use exponential backoff for 5xx errors from providers.
Circuit Breaker: If Twilio is down, open the circuit to stop making calls and route to Dead Letter Queue (DLQ).

Storage

Access Pattern:
User Preferences: High-read, low-write.
Notification Logs: High-write, sequential.
Database Table Design:
User Preferences: user_id (PK), channel, enabled, quiet_hour_start, quiet_hour_end.
Notification Status: notification_id (PK), user_id, channel, status (PENDING, SENT, DELIVERED, FAILED), retry_count.
Technical Selection:
PostgreSQL: For user preferences and templates (Relational consistency needed).
Cassandra/DynamoDB: For notification status/logs to handle high-write throughput and TTL-based expiry.
Distribution Logic: Shard Notification Status DB by user_id to allow efficient retrieval of a user's notification history.

Cache

Purpose & Justification: Reduce DB load for frequently accessed user preferences and message templates.
Key-Value Schema:
Key: pref:{user_id}, Value: JSON (preferences), TTL: 24h.
Key: tmpl:{template_id}, Value: String (body), TTL: 1h.
Technical Selection: Redis. Provides sub-millisecond latency and simple eviction policies (LRU).

Messaging

Purpose & Decoupling: Kafka acts as a buffer. Ingestion Service doesn't wait for 3rd party providers.
Throughput & Partitioning:
Use user_id as the partition key to ensure order for a single user (e.g., OTP before Promo).
Separate topics for high_priority and low_priority to prevent marketing blasts from delaying OTPs.
Failure Handling: Failed messages after N retries are moved to a Dead Letter Queue (DLQ) for manual inspection or secondary retry logic.
Technical Selection: Kafka. High throughput and high availability.

Infrastructure (Optional)

Observability:
Metrics: Track delivery_latency, error_rate_by_provider, and queue_depth.
Distributed Tracing: Trace a notification from the internal trigger through the queue to the final provider call.
Wrap Up

Advanced Topics

Trade-offs: We chose Eventual Consistency for delivery status. A user might see a "Sent" status a few seconds before the provider actually confirms delivery.
Reliability: We use At-least-once delivery. This means a user might rarely receive two notifications if the worker crashes after sending but before acknowledging the message in Kafka.
Bottleneck Analysis: The main bottleneck is often the 3rd Party Provider rate limits. We solve this by implementing per-provider rate limiters in our workers.
Distinguishing Insights:
Smart Selection: If a user is active on the app (checked via a Presence Service), we prioritize Push over SMS to save costs.
Feedback Loop: Use webhooks from SendGrid/Twilio to update the Notification Status DB. This allows us to prove delivery to the business side.