The Question
DesignReal-time Ad Frequency Capping System
Design a high-scale system to enforce frequency caps for digital advertisements. A frequency cap limits the number of times a specific user sees a specific ad or campaign within a set time window (e.g., no more than 3 views per 24 hours). The system must handle 500,000+ QPS with sub-10ms latency for the capping check during the ad selection process. Consider how to handle sliding windows, data storage for billions of counters, and the trade-offs between strict consistency and system availability.
Redis
Kafka
gRPC
PostgreSQL
Lua Scripting
Sorted Sets
Load Balancer
Questions & Insights
Clarifying Questions
Scale & Performance: What is the expected scale in terms of Daily Active Users (DAU) and Peak QPS? (e.g., 500M DAU, 1M QPS).
Granularity: Are we capping at the Campaign level, Creative level, or Advertiser level?
Window Type: Is this a fixed window (e.g., calendar day) or a sliding window (e.g., last 24 hours)?
Strictness: Is it a "hard cap" (latency-sensitive blocking) or a "soft cap" (eventual consistency where a user might see an ad slightly more than the limit)?
Data Retention: How long do we need to store frequency data? (e.g., daily, weekly, or lifetime).
Assumptions for MVP:
Scale: 100M DAU, 500k Peak QPS.
Granularity: Primarily Campaign-level capping.
Window: Rolling 24-hour window (requires sliding window logic).
Strictness: Soft cap. Low latency (<10ms) is prioritized over 100% strictness.
Retention: Up to 30 days.
Thinking Process
Bottleneck: The main challenge is the high-frequency "Read-Check-Modify" cycle. Every ad request requires checking multiple counters (user-campaign, user-advertiser) before the auction starts.
Strategy Questions:
How do we minimize the latency of the "check" during the critical path of an ad auction?
How do we handle massive write volumes (impressions) without bottlenecking the database?
Which data structure optimizes for sliding window counters while staying memory-efficient?
How do we ensure the system scales horizontally as the number of campaigns and users grows?
Bonus Points
Probabilistic Data Structures: Using Heavy Keepers or Count-Min Sketches for high-volume, low-precision capping to save 90% memory for long-tail users.
Locality-Sensitive Hashing/Sharding: Sharding Redis by
user_id to ensure all frequency data for a specific user resides on the same node, enabling atomic multi-key checks.Write-Behind Pattern: Decoupling the "Increment" from the "Check" using a high-throughput messaging bus (Kafka) to ensure impression logging doesn't slow down the ad server.
Edge Capping: Implementing local cache TTLs on the Ad Server to avoid network calls for recently capped campaigns (negative caching).
Design Breakdown
Functional Requirements
Core Use Cases:
Check if a user has exceeded the frequency cap for a specific campaign before serving an ad.
Increment the count for a campaign when an impression occurs.
Support various time windows (Hourly, Daily, Weekly).
Scope Control:
In-Scope: Real-time capping logic, high-scale counter storage, async increment processing.
Out-of-Scope: Ad bidding logic, click-through rate (CTR) prediction, billing/budgeting systems.
Non-Functional Requirements
Scale: Must handle 500k+ QPS with billions of user-campaign combinations.
Latency: The frequency check must return in < 5ms to fit within the 100ms total ad-serving budget.
Availability: 99.99% availability. If the frequency service is down, fail-open (show the ad) rather than blocking the ad server.
Consistency: Eventual consistency is acceptable (latency of < 2 seconds for a count to propagate).
Fault Tolerance: Use sharded Redis with replication to handle node failures.
Estimation
Traffic: 500k QPS. Each request checks ~10 candidate campaigns = 5M checks/sec.
Storage: 100M users x 10 active campaign counters/user = 1B counters.
Memory: Each counter (Redis Sorted Set entry) is ~50 bytes. 1B x 50 bytes = 50 GB. With replication and overhead, ~150 GB RAM.
Bandwidth: 500k QPS * 1KB/req = 500 MB/s.
Blueprint
Concise Summary: A low-latency frequency capping service using a Redis-based sliding window for real-time checks and Kafka-based asynchronous updates to maintain high throughput.
Major Components:
Ad Server: Performs the candidate selection and queries the Cap Service for filtering.
Frequency Cap Cache (Redis): Stores user-campaign impression timestamps using Sorted Sets for sliding window accuracy.
Message Bus (Kafka): Buffers impression events to decouple writes from the ad delivery path.
Cap Updater Service: Consumes Kafka events and updates Redis counters.
Simplicity Audit: This design avoids complex stream processors (like Flink) for the MVP, using a simple consumer to update Redis, which handles the sliding window logic natively via
ZREMRANGEBYSCORE.Architecture Decision Rationale:
Redis is chosen for its sub-millisecond latency and built-in sorted set support for sliding windows.
Kafka ensures that even during traffic spikes, impression logging does not impact the latency of the ad server.
Functional Satisfaction: Meets the need for rolling 24-hour windows and campaign-level granularity.
Non-functional Satisfaction: High availability through Redis Sentinel/Cluster and high throughput via Kafka.
High Level Architecture
Sub-system Deep Dive
Service
Topology & Scaling
Ad Server: Stateless, scales on CPU/QPS. Deployed across multiple availability zones (AZs).
Cap Updater: Stateless consumer group. Scaled based on Kafka consumer lag.
API Schema Design
Internal Check (gRPC):
CheckCap(user_id, campaign_ids[]) returns Map<campaign_id, is_capped>.Idempotency: Kafka offsets ensure at-least-once delivery; Redis updates are idempotent (adding a unique timestamped member to a Sorted Set).
Resilience & Reliability
Fail-Open: If Redis is unreachable, the Cap Service returns "not capped" for all campaigns to ensure revenue isn't lost.
Timeouts: Aggressive 5ms timeout for Redis lookups.
Storage
Access Pattern
High Read/Write: Read-heavy on the Ad Server path, Write-heavy on the Impression path.
Database Table Design (Campaign Meta)
Campaigns Table (SQL):
id (PK), advertiser_id, cap_limit, window_seconds.Technical Selection
PostgreSQL: Stores the configuration of caps for campaigns. Cached in-memory by the Ad Server.
Distribution Logic
Campaign metadata is small enough to be fully cached in the memory of each Ad Server instance, refreshed every 60 seconds.
Cache
Purpose & Justification: Redis stores the real-time "state" of the user's history to allow for instant capping decisions.
Key-Value Schema:
Key:
fcap:{user_id}:{campaign_id}Data Structure: Sorted Set (ZSET).
Value:
impression_id (unique string).Score:
timestamp.Check Logic:
ZCOUNT key (now - window) now. If count >= limit, cap it.Update Logic:
ZADD key timestamp impression_id + ZREMRANGEBYSCORE key 0 (now - window) + EXPIRE key window.Technical Selection: Redis Cluster. Use User_ID hashing to shard data across nodes, ensuring all campaigns for one user are on the same shard for potential pipelining.
Failure Handling: Standard master-slave replication.
Messaging
Purpose & Decoupling: Decouples the user-facing ad-serving path from the counter increment logic.
Event Schema:
{"user_id": "u123", "campaign_id": "c456", "timestamp": 1700000000, "imp_id": "uuid"}.Throughput & Partitioning: Partition by
user_id to ensure sequential updates for the same user-campaign pair, preventing race conditions in the sorted set.Technical Selection: Kafka. High durability and replayability for debugging.
Data Processing
Processing Model: Simple stream consumption (Python/Go/Java) to read from Kafka and execute Redis commands.
Technical Selection: Custom worker service. Flink/Spark is overkill for simple counter increments.
Wrap Up
Advanced Topics
Trade-offs: We chose Eventual Consistency. A user might see an ad a few extra times if the Kafka consumer lags. This is a standard trade-off in Ads to maintain low latency.
Reliability: Fail-open design ensures that a failure in the capping system doesn't crash the entire Ad Server.
Optimization (Memory): For long windows (e.g., 30 days), Sorted Sets can grow large.
Alternative*: If 100% accuracy isn't needed, use Sliding Window Log with Count-Min Sketch** for historical data and only use Redis for the last 24 hours.
Optimization (CPU): Use Redis Lua Scripts to combine
ZADD, ZCOUNT, and ZREMRANGEBYSCORE into a single atomic round trip to reduce network overhead.Security: The system is internal-only; however, the
impression_id must be cryptographically signed in the ad response to prevent "impression stuffing" (users manually hitting the URL to cap ads they don't like).