The Question

Real-time Ad Frequency Capping System

Design a high-scale system to enforce frequency caps for digital advertisements. A frequency cap limits the number of times a specific user sees a specific ad or campaign within a set time window (e.g., no more than 3 views per 24 hours). The system must handle 500,000+ QPS with sub-10ms latency for the capping check during the ad selection process. Consider how to handle sliding windows, data storage for billions of counters, and the trade-offs between strict consistency and system availability.

Redis

Kafka

gRPC

PostgreSQL

Lua Scripting

Sorted Sets

Load Balancer

Questions & Insights

Clarifying Questions

Scale & Performance: What is the expected scale in terms of Daily Active Users (DAU) and Peak QPS? (e.g., 500M DAU, 1M QPS).

Granularity: Are we capping at the Campaign level, Creative level, or Advertiser level?

Window Type: Is this a fixed window (e.g., calendar day) or a sliding window (e.g., last 24 hours)?

Strictness: Is it a "hard cap" (latency-sensitive blocking) or a "soft cap" (eventual consistency where a user might see an ad slightly more than the limit)?

Data Retention: How long do we need to store frequency data? (e.g., daily, weekly, or lifetime).

Assumptions for MVP:

Scale: 100M DAU, 500k Peak QPS.

Granularity: Primarily Campaign-level capping.

Window: Rolling 24-hour window (requires sliding window logic).

Strictness: Soft cap. Low latency (<10ms) is prioritized over 100% strictness.

Retention: Up to 30 days.

Thinking Process

Bottleneck: The main challenge is the high-frequency "Read-Check-Modify" cycle. Every ad request requires checking multiple counters (user-campaign, user-advertiser) before the auction starts.

Strategy Questions:

How do we minimize the latency of the "check" during the critical path of an ad auction?

How do we handle massive write volumes (impressions) without bottlenecking the database?

Which data structure optimizes for sliding window counters while staying memory-efficient?

How do we ensure the system scales horizontally as the number of campaigns and users grows?

Bonus Points

Probabilistic Data Structures: Using Heavy Keepers or Count-Min Sketches for high-volume, low-precision capping to save 90% memory for long-tail users.

Locality-Sensitive Hashing/Sharding: Sharding Redis by user_id to ensure all frequency data for a specific user resides on the same node, enabling atomic multi-key checks.

Write-Behind Pattern: Decoupling the "Increment" from the "Check" using a high-throughput messaging bus (Kafka) to ensure impression logging doesn't slow down the ad server.

Edge Capping: Implementing local cache TTLs on the Ad Server to avoid network calls for recently capped campaigns (negative caching).

Design Breakdown

Functional Requirements

Core Use Cases:

Check if a user has exceeded the frequency cap for a specific campaign before serving an ad.

Increment the count for a campaign when an impression occurs.

Support various time windows (Hourly, Daily, Weekly).

Scope Control:

In-Scope: Real-time capping logic, high-scale counter storage, async increment processing.

Out-of-Scope: Ad bidding logic, click-through rate (CTR) prediction, billing/budgeting systems.

Non-Functional Requirements

Scale: Must handle 500k+ QPS with billions of user-campaign combinations.

Latency: The frequency check must return in < 5ms to fit within the 100ms total ad-serving budget.

Availability: 99.99% availability. If the frequency service is down, fail-open (show the ad) rather than blocking the ad server.

Consistency: Eventual consistency is acceptable (latency of < 2 seconds for a count to propagate).

Fault Tolerance: Use sharded Redis with replication to handle node failures.

Estimation

Traffic: 500k QPS. Each request checks ~10 candidate campaigns = 5M checks/sec.

Storage: 100M users x 10 active campaign counters/user = 1B counters.

Memory: Each counter (Redis Sorted Set entry) is ~50 bytes. 1B x 50 bytes = 50 GB. With replication and overhead, ~150 GB RAM.

Bandwidth: 500k QPS * 1KB/req = 500 MB/s.

Blueprint

Concise Summary: A low-latency frequency capping service using a Redis-based sliding window for real-time checks and Kafka-based asynchronous updates to maintain high throughput.

Major Components:

Ad Server: Performs the candidate selection and queries the Cap Service for filtering.

Frequency Cap Cache (Redis): Stores user-campaign impression timestamps using Sorted Sets for sliding window accuracy.

Message Bus (Kafka): Buffers impression events to decouple writes from the ad delivery path.

Cap Updater Service: Consumes Kafka events and updates Redis counters.

Simplicity Audit: This design avoids complex stream processors (like Flink) for the MVP, using a simple consumer to update Redis, which handles the sliding window logic natively via ZREMRANGEBYSCORE.

Architecture Decision Rationale:

Redis is chosen for its sub-millisecond latency and built-in sorted set support for sliding windows.

Kafka ensures that even during traffic spikes, impression logging does not impact the latency of the ad server.

Functional Satisfaction: Meets the need for rolling 24-hour windows and campaign-level granularity.

Non-functional Satisfaction: High availability through Redis Sentinel/Cluster and high throughput via Kafka.

High Level Architecture

Sub-system Deep Dive

Service

Topology & Scaling

Ad Server: Stateless, scales on CPU/QPS. Deployed across multiple availability zones (AZs).

Cap Updater: Stateless consumer group. Scaled based on Kafka consumer lag.

API Schema Design

Internal Check (gRPC): CheckCap(user_id, campaign_ids[]) returns Map<campaign_id, is_capped>.

Idempotency: Kafka offsets ensure at-least-once delivery; Redis updates are idempotent (adding a unique timestamped member to a Sorted Set).

Resilience & Reliability

Fail-Open: If Redis is unreachable, the Cap Service returns "not capped" for all campaigns to ensure revenue isn't lost.

Timeouts: Aggressive 5ms timeout for Redis lookups.

Storage

Access Pattern

High Read/Write: Read-heavy on the Ad Server path, Write-heavy on the Impression path.

Database Table Design (Campaign Meta)

Campaigns Table (SQL): id (PK), advertiser_id, cap_limit, window_seconds.

Technical Selection

PostgreSQL: Stores the configuration of caps for campaigns. Cached in-memory by the Ad Server.

Distribution Logic

Campaign metadata is small enough to be fully cached in the memory of each Ad Server instance, refreshed every 60 seconds.

Cache

Purpose & Justification: Redis stores the real-time "state" of the user's history to allow for instant capping decisions.

Key-Value Schema:

Key: fcap:{user_id}:{campaign_id}

Data Structure: Sorted Set (ZSET).

Value: impression_id (unique string).

Score: timestamp.

Check Logic: ZCOUNT key (now - window) now. If count >= limit, cap it.

Update Logic: ZADD key timestamp impression_id + ZREMRANGEBYSCORE key 0 (now - window) + EXPIRE key window.

Technical Selection: Redis Cluster. Use User_ID hashing to shard data across nodes, ensuring all campaigns for one user are on the same shard for potential pipelining.

Failure Handling: Standard master-slave replication.

Messaging

Purpose & Decoupling: Decouples the user-facing ad-serving path from the counter increment logic.

Event Schema: {"user_id": "u123", "campaign_id": "c456", "timestamp": 1700000000, "imp_id": "uuid"}.

Throughput & Partitioning: Partition by user_id to ensure sequential updates for the same user-campaign pair, preventing race conditions in the sorted set.

Technical Selection: Kafka. High durability and replayability for debugging.

Data Processing

Processing Model: Simple stream consumption (Python/Go/Java) to read from Kafka and execute Redis commands.

Technical Selection: Custom worker service. Flink/Spark is overkill for simple counter increments.

Wrap Up

Advanced Topics

Trade-offs: We chose Eventual Consistency. A user might see an ad a few extra times if the Kafka consumer lags. This is a standard trade-off in Ads to maintain low latency.

Reliability: Fail-open design ensures that a failure in the capping system doesn't crash the entire Ad Server.

Optimization (Memory): For long windows (e.g., 30 days), Sorted Sets can grow large.

Alternative*: If 100% accuracy isn't needed, use Sliding Window Log with Count-Min Sketch** for historical data and only use Redis for the last 24 hours.

Optimization (CPU): Use Redis Lua Scripts to combine ZADD, ZCOUNT, and ZREMRANGEBYSCORE into a single atomic round trip to reduce network overhead.

Security: The system is internal-only; however, the impression_id must be cryptographically signed in the ad response to prevent "impression stuffing" (users manually hitting the URL to cap ads they don't like).