The Question

Scalable A/B Testing and Experimentation Platform

Design an experimentation system capable of supporting millions of concurrent users. The system should allow product managers to define experiments with specific targeting rules (geo, device, etc.) and enable developers to retrieve variation assignments with sub-millisecond latency. Address the challenges of bucket consistency (ensuring a user stays in their assigned group), mutually exclusive experiments, and high-volume event ingestion for real-time result analysis. Explain your choice of evaluation logic (client-side vs server-side) and how you handle the massive scale of conversion event data.

PostgreSQL

Redis

Kafka

ClickHouse

MurmurHash3

Flink

CDC

gRPC

Questions & Insights

Clarifying Questions

What is the scale of the system? (e.g., Number of DAU, total number of concurrent experiments, and expected QPS for the evaluation endpoint).

What is the latency budget for the evaluation? (Does the client call the server every time, or is it a local evaluation via an SDK?).

What are the targeting requirements? (Simple user ID hashing vs. complex attributes like geography, device type, or custom user segments).

How do we handle mutually exclusive experiments? (Can a user be in multiple experiments at once, or do we need "Layers" to prevent interference?).

Is the analysis of results (e.g., P-values, confidence intervals) in-scope?

Assumptions for this design:

Scale: 100M DAU, 10k QPS peak for evaluation (assuming SDK-side caching or periodic fetching).

Latency: Evaluation must be < 50ms if server-side; < 1ms if local SDK evaluation.

Targeting: Support for User ID, Device, and Geo.

Consistency: A user must see the same variation for the duration of the experiment (Sticky Bucketing).

MVP Focus: Centralized management, deterministic hashing for evaluation, and raw event collection.

Thinking Process

Core Bottleneck: Minimizing the impact of the "Experiment Call" on application performance while ensuring assignment consistency.

Key Questions for the flow:

How do we map a User ID to a bucket without storing every single mapping? (Answer: Deterministic Hashing).

How do we ensure two experiments don't overlap if they modify the same UI element? (Answer: Experiment Layers/Domains).

How do we propagate configuration changes quickly across a global fleet? (Answer: Pub/Sub + Local Caching).

How do we collect results without overwhelming the primary database? (Answer: Async Event Streaming).

Bonus Points

Deterministic Hashing with Salt: Using hash(UserId + ExperimentId + Salt) % 100 to ensure uniform distribution and avoid "lucky" user clusters across experiments.

Configuration Sidecars: Deploying the evaluation logic as a sidecar or a local library (SDK) that pulls a "manifest" file, eliminating the network hop for every feature flag check.

Bayesian vs. Frequentist Analysis: Support for real-time Bayesian updating for early "winning" detection to stop underperforming experiments faster (Multi-armed Bandit).

Override/QA Mechanics: Specialized headers or cookie-based overrides to allow developers to force themselves into specific variations for testing.

Design Breakdown

Functional Requirements

Core Use Cases:

PMs/Devs can create, start, and stop experiments via a UI.

Applications can query for a user's variation (Control vs. Treatment).

Track events (clicks, conversions) associated with specific experiment buckets.

Scope Control:

In-scope: Experiment management, variation assignment, and event ingestion.

Out-of-scope: Complex automated statistical analysis (MVP will provide raw data for SQL queries).

Non-Functional Requirements

Scale: Support for 100M+ users and thousands of active experiments.

Latency: Evaluation must not block the critical path of the host application.

Availability: 99.99% availability. If the service is down, fall back to "Control" (fail-open).

Consistency: High "Sticky" consistency for assigned variations.

Security: RBAC for experiment management; obfuscation of experiment names in public APIs.

Estimation

Traffic: 100M DAU. If users fetch config once per session, and average 3 sessions/day: 300M requests/day ≈ 3,500 Average QPS. Peak QPS ≈ 10,000.

Storage:

Metadata: 1,000 active experiments * 10KB/exp = 10MB (Tiny).

Events: 100M users * 10 events/day = 1B events/day. At 200 bytes per event ≈ 200GB/day.

Bandwidth: 10k QPS * 1KB response ≈ 10MB/s (Manageable).

Blueprint

Concise Summary: A centralized Management Service for experiment definitions, a high-performance Evaluation API (or SDK) using deterministic hashing, and a Kafka-based ingestion pipeline for tracking metrics.

Major Components:

Management UI/Service: Administrative interface for defining experiment segments, weights, and layers.

Evaluation Service: Stateless API that calculates variation assignments based on user context and experiment manifests.

Experiment Store (Postgres): Source of truth for experiment configurations and targeting rules.

Config Cache (Redis): Low-latency storage for "Active" experiment manifests to be used by the Evaluation Service.

Event Ingestor (Kafka): High-throughput buffer for capturing exposure and conversion events for later analysis.

Simplicity Audit: We use deterministic hashing instead of a "Mapping Table" (User <-> Variation) to save Petabytes of storage and enable O(1) evaluation.

Architecture Decision Rationale:

Why this architecture?: Decoupling management from evaluation ensures that even if the Admin UI is down, experiments continue to run.

Functional Satisfaction: Meets all lifecycle needs from creation to bucketing.

Non-functional Satisfaction: Redis and deterministic logic ensure sub-millisecond assignment, while Kafka handles the massive write load of event tracking.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery & Traffic Routing: Evaluation configs (manifests) are cached at the Edge or via CDN for SDK-based clients to fetch periodically (TTL 1-5 mins).

Security: API Gateway handles JWT validation for the Admin Dashboard and rate-limiting for the Evaluation API to prevent DDoS.

Service

Topology & Scaling: Evaluation Service is stateless and deployed across multiple Availability Zones (AZs). It scales horizontally based on CPU/Request count.

API Schema Design:

GET /v1/assignments?user_id=123&context={geo:US}

POST /v1/events (JSON payload with user_id, experiment_id, variation_id, event_type).

Resilience: If the Evaluation Service fails, the client SDK uses local "default" values (Control). Evaluation logic uses MurmurHash3 for fast, uniform distribution.

Storage

Access Pattern:

Experiment DB: High read (by Management Service), Low write (Admin updates).

Analytics DB: High write (Append-only), High read (Complex aggregations).

Database Table Design:

Experiments: id (PK), name, status (DRAFT/RUNNING/PAUSED), layer_id, start_time, end_time.

Variations: id (PK), experiment_id, name, weight (0-10000), config_json.

Technical Selection:

PostgreSQL: For experiment metadata (Relational integrity needed).

ClickHouse: For analytics (Columnar storage is superior for aggregating billions of rows for P-value calculation).

Cache

Purpose: Store the "Global Manifest" (a serialized list of all currently active experiments and their targeting rules).

Key-Value Schema: key: active_experiments_v1, value: List<ExperimentDefinition>.

Technical Selection: Redis. Used as a read-through cache for the Evaluation Service to avoid hitting Postgres on every request.

Messaging

Purpose: Decouple the logging of "Exposure Events" (User X saw Variation Y) from the user's request path.

Event Schema: timestamp, user_id, experiment_id, variation_id, event_name, properties.

Technical Selection: Kafka. Essential for handling the 1B+ events/day volume. Partitions are keyed by user_id to ensure chronological ordering of events for a single user.

Data Processing

Processing Model: Mini-batch processing.

Processing DAG: Consumes from Kafka -> De-duplicates events -> Aggregates (Sum of clicks / Sum of users) -> Writes to ClickHouse.

Technical Selection: Flink or a simple Kafka Streams app. It handles the "Join" between Exposure events and Conversion events.

Wrap Up

Advanced Topics

Trade-offs: We choose Deterministic Hashing over Database Lookups. Sacrifice: Harder to manually move a single user to a specific bucket without adding "Override" logic. Gain: Infinite scalability.

Reliability: Use a "Kill Switch" for every experiment. If an experiment causes a 500-error spike, the Management Service updates the Redis manifest to set weights to 0% for Treatment, propagating in seconds.

Bottleneck Analysis: The "Hot Key" in Redis for the global manifest. Solution: Local in-memory cache within the Evaluation Service instances (refreshed every 30s) to reduce Redis pressure.

Security: PII (User IDs) should be hashed/anonymized before being stored in the Analytics DB to comply with GDPR/CCPA.

Advanced Optimization: Layering (Overlapping Experiments). Divide the 0-100% traffic range into "Slots" (e.g., 1000 slots). Experiments in the same "Layer" share the same slots and are mutually exclusive. Experiments in different "Layers" use independent hashing to ensure orthogonality.