The Question
Design

Scalable Multi-Channel Notification Engine

Design a high-throughput notification system capable of delivering transactional and marketing messages across Push, Email, and SMS. The system must handle millions of daily requests, manage external provider failures gracefully, prioritize time-sensitive messages like OTPs, and ensure users aren't over-notified through rate limiting.
Kafka
Redis
PostgreSQL
Worker Pools
Third-Party APIs
Questions & Insights

Clarifying Questions

Scale and Throughput: What is the expected peak volume (notifications per second) and daily active users (DAU)?
Assumption: 10 million notifications per day, with peak traffic of 500 requests per second (RPS).
Latency Requirements: Is there a strict SLA for delivery (e.g., OTPs vs. Marketing)?
Assumption: High priority for transactional/OTP (under 10s), best effort for marketing (under 30 mins).
Reliability and Guarantees: Do we need "exactly-once" delivery or is "at-least-once" acceptable?
Assumption: At-least-once delivery is sufficient; the client handles idempotency if needed.
Provider Strategy: Will we use specific third-party gateways (e.g., Twilio, SendGrid, FCM/APNS)?
Assumption: Yes, the system will interface with standard external providers.

Thinking Process

Core Bottleneck: External provider latency and downtime. Third-party APIs are the slowest and most unreliable part of the chain.
Progressive Logical Flow:
How do we prevent the API from blocking on slow external calls? (Introduce Asynchronous Messaging).
How do we handle different priorities (OTP vs. News)? (Introduce Priority Queues).
How do we ensure we don't get banned by providers or spam users? (Introduce Rate Limiting).
How do we track if a notification actually reached the user? (Introduce Callback/Status Tracking).

Bonus Points

Smart Retries with Exponential Backoff & Jitter: Avoid "thundering herd" problems when a provider like SendGrid recovers from an outage.
Provider Agnostic Abstraction: Implement a "Strategy Pattern" for providers to allow seamless failover between Twilio and Vonage if one fails.
Multi-region Idempotency: Using a global unique request_id stored in a distributed cache to prevent duplicate sends during cross-region retries.
Dead Letter Queues (DLQ): Implementing a structured way to handle permanently failed messages for later manual inspection or bulk reprocessing.
Design Breakdown

Functional Requirements

Support three channels: Push (iOS/Android), Email, and SMS.
Provide a single API endpoint for clients to trigger notifications.
Support message templates to keep payloads small.
Track delivery status (Sent, Delivered, Failed).

Non-Functional Requirements

High Availability: The system must be available to accept requests even if a specific channel provider is down.
Scalability: Must handle bursts during marketing events.
Extensibility: Easy to add new channels (e.g., Slack, WhatsApp) in the future.
Low Latency: Transactional notifications must be queued and processed immediately.

Estimation

Daily Volume: 10M notifications.
Average Payload: 2KB (Metadata + Body).
Storage: 10M * 2KB = 20GB/day. 1 year retention = ~7.3TB.
Compute: 10M / 86400 seconds \approx 115 Average RPS. Peak is 5x \approx 575 RPS. A few small worker clusters can handle this.

Blueprint

Concise Summary: A microservices-based architecture utilizing a message queue to decouple notification requests from actual delivery, ensuring high availability and fault tolerance.
Major Components:
API Gateway: Entry point for authentication, rate limiting, and request validation.
Notification Service: Validates requests, fetches templates, and persists the initial record.
Message Queue: Decouples the ingestion from processing and manages priorities.
Channel Workers: Specialized consumers that format messages and call 3rd-party APIs.
PostgreSQL: Stores notification logs, user preferences, and templates.
Simplicity Audit: This architecture uses standard components (Queue/Worker/DB) without introducing complex stream processing or service meshes that aren't required for 500 RPS.
Architecture Decision Rationale:
Why this architecture?: Asynchronous processing via queues is the industry standard for notification systems to handle the inherent unreliability of external providers.
Functional Satisfaction: Covers all channels via specialized workers and tracks status via a shared database.
Non-functional Satisfaction: Queues provide a buffer for scalability; multiple worker instances provide high availability.

High Level Architecture

Sub-system Deep Dive

Service

Topology: Notification Service and Workers are deployed as Docker containers in a K8s cluster, allowing independent scaling based on queue depth.
API Spec:
POST /v1/notifications: Accepts user_id, template_id, priority, and placeholders.
Protocols: REST for ingestion; Webhooks for provider feedback.

Storage

Data Model:
notifications: id (UUID), user_id, channel_type, status (Pending, Sent, Failed), created_at.
templates: id, content_body, channel_type.
Database Logic: PostgreSQL is used for ACID compliance on status updates. Indexing on user_id and created_at for history lookups.

Cache

Details: Redis is used for:
Rate Limiting: Preventing a single user from receiving too many notifications (Fixed window or Token Bucket).
Template Caching: Reducing DB load for frequently used message formats.
TTL: 24 hours for rate-limiting keys; 1 hour for templates.

Messaging

Details: RabbitMQ or Kafka.
Topic Structure: Channels are split into topics: notify.push, notify.email, notify.sms.
Priority: Two queues per channel (e.g., email.high_priority, email.bulk) to ensure OTPs aren't stuck behind newsletters.
Delivery Guarantee: At-least-once via manual acknowledgments after provider confirmation.
Wrap Up

Advanced Topics

Monitoring:
Metrics: Queue depth (critical for scaling), Provider error rates (5xx), Latency from ingestion to delivery.
Tools: Prometheus for metrics, Grafana for visualization.
Trade-offs:
Consistency vs. Availability: We choose Availability and Eventual Consistency. A notification might be sent twice (at-least-once) to ensure it is never lost.
Bottlenecks: External provider API limits. We must implement circuit breakers to stop sending if a provider returns "Too Many Requests."
Failure Handling:
Worker Failure: Message stays in queue and is re-delivered to another worker.
Provider Failure: Automatic failover to a secondary provider (e.g., if SendGrid is down, use Amazon SES).
Alternatives: Using AWS SNS/SQS as a managed alternative to reduce operational overhead (good for startups).