The Question

A/B Testing Platform Design

Design a scalable A/B testing system that manages up to 1,000 concurrent experiments and handles end-user allocation. The system must allow administrators to toggle experiments between active/inactive states and enable employees to query these states. A critical requirement is an automated daily audit report that calculates the total number of active/inactive experiments and tracks the frequency and details of all state transitions within a 24-hour window. Focus on how you would achieve low-latency allocation for millions of users while ensuring reliable audit logging for reporting.

PostgreSQL

MurmurHash

In-Memory Caching

REST

Batch Processing

CDC

RBAC

Questions & Insights

Clarifying Questions

Scale of End-Users: While there are 1,000 experiments, how many end-users are being allocated to these experiments? (Assumption: 10M DAU, requiring high-throughput allocation).

Allocation Logic: Is the allocation dynamic based on user attributes, or a simple percentage-based rollout? (Assumption: Support for both percentage rollouts and basic attribute targeting).

Reporting Depth: Does the "report" only cover experiment state changes, or does it also include performance metrics like click-through rates? (Assumption: Focus primarily on the requested state-change audit and experiment status counts, with extensibility for metrics).

Latency Requirements: What is the target latency for the allocation decision? (Assumption: Under 50ms for server-side or local evaluation via SDK).

Thinking Process

Deterministic Allocation: How can we assign millions of users to 1,000 experiments without storing every single user-to-experiment mapping? (Solution: Use deterministic hashing of user_id + experiment_id).

Configuration Distribution: How do we ensure the 1,000 experiments' "active" status is reflected instantly across the fleet? (Solution: Use a distributed cache or a localized configuration sidecar).

Audit Persistence: How do we track every state change for the end-of-day report? (Solution: An "Experiment Audit Log" table that records every status transition with a timestamp).

Reporting Efficiency: How do we generate the daily summary without scanning billions of user events? (Solution: A scheduled batch job that aggregates metadata from the Audit Log and Experiment table).

Bonus Points

Sticky Bucketing: Implementing MurmurHash3 with a seed to ensure a user stays in the same bucket even if other experiments are added or removed.

Conflict Detection: Logic to prevent overlapping experiments on the same user segment (Mutually Exclusive Experiments) using "Layers" or "Domains".

CDC for Reporting: Using Change Data Capture (CDC) from the primary database to a data warehouse to ensure the reporting engine never impacts the production management performance.

Design Breakdown

Functional Requirements

Core Use Cases:

Admins can Create, Read, Update, and Delete (CRUD) up to 1000 experiments.

Admins can toggle experiments between Active and Non-Active.

The system allocates users to active experiments based on defined weights.

Users can query the current state of any experiment.

Daily report generation: Total active/non-active, count of state changes, and specific transition details.

Scope Control:

In-scope: Experiment management, deterministic allocation logic, and state-change reporting.

Out-of-scope: Complex statistical analysis (p-values, confidence intervals), real-time user clickstream ingestion.

Non-Functional Requirements

Scale: Support 1,000 concurrent experiments and 10,000+ RPS for allocation queries.

Latency: Allocation decisions must be < 50ms.

Availability: 99.9% availability for the management UI; 99.99% for the allocation engine (must fail open or use cached defaults).

Consistency: Strong consistency for experiment management; eventual consistency (~seconds) for state updates reaching the allocation engine.

Security: RBAC for experiment management (who can toggle "Active").

Estimation

Traffic: 10M DAU. If each user triggers 5 allocation checks/day = 50M requests/day ≈ 600 QPS average (Peak 2k-3k QPS).

Storage (Metadata): 1,000 experiments * 10KB/exp ≈ 10MB. Very small.

Storage (Audit Logs): Assuming 1,000 experiments change state twice a day = 2,000 rows/day. Negligible storage.

Bandwidth: 10MB config updates distributed to 100 app nodes = 1GB daily traffic for config sync.

Blueprint

Concise Summary: A management service allows admins to define experiment metadata in a relational database. An allocation service (or SDK) uses deterministic hashing to assign users to experiments locally, while a daily batch job processes audit logs for reporting.

Major Components:

Experiment Service: Handles CRUD operations and stores experiment definitions.

Metadata DB (PostgreSQL): Stores experiment state and a dedicated audit table for state changes.

Allocation Engine: Lightweight service (or client SDK) that evaluates if a user is in an experiment using hash-based bucketing.

Reporting Job: A daily worker that aggregates state transitions from the audit logs.

Simplicity Audit: This design avoids complex event-streaming platforms (Kafka) because the scale of experiment state changes (1,000 experiments) is low enough for a simple relational database audit table.

Architecture Decision Rationale:

Why this?: Deterministic hashing eliminates the need for a massive "UserAssignment" table, making the system horizontally scalable and low-latency.

Functional Satisfaction: Covers management, allocation, and the specific "state change" reporting requirements.

Non-functional Satisfaction: High availability via stateless allocation nodes and low latency via in-memory caching of the 1,000 experiment configs.

High Level Architecture

Sub-system Deep Dive

Service

Experiment Management Service:

Topology: Stateless REST service deployed in Multi-AZ.

API Schema:

POST /experiments: Create experiment (Name, Description, Status, Traffic%).

PATCH /experiments/{id}/status: Toggle Active/Non-active.

GET /experiments: List all experiments for employees.

Idempotency: Use client_request_id for state changes to prevent double-toggling.

Allocation Service:

Logic: Bucket = MurmurHash3(user_id + experiment_id) % 100.

If Bucket < experiment.traffic_percentage and experiment.status == 'Active', user is in the "Treatment" group.

Sync: Pulls the full list of 1,000 experiments from the DB every 30 seconds and stores them in an in-memory Hashmap.

Storage

Access Pattern: Heavy read for allocation config (cached), low write for admin management. High reliability required for audit logs.

Database Table Design:

Table: `experiments

id (UUID, PK)

name (String)

status (Enum: ACTIVE, INACTIVE)

traffic_percent (Int)

updated_at (Timestamp)

Table: `experiment_audit_logs

id (BigInt, PK)

experiment_id (FK)

old_status (Enum)

new_status (Enum)

changed_by (UserID)

created_at (Timestamp, Index for Reporting)

Technical Selection: PostgreSQL.

Rationale: Handles relational integrity for 1,000 records effortlessly. ACID compliance is critical for ensuring the audit log is written whenever a status changes.

Cache

Purpose: To avoid DB hits for every user allocation request.

Implementation: Local In-Memory Cache (Guava or Caffeine).

Refresh Strategy: Poll the DB every 30-60s. Since there are only 1,000 experiments, the entire payload is < 1MB, making frequent polling very cheap.

Data Processing

Reporting Model: Batch Processing.

Processing DAG:

Step 1: Scan experiment_audit_logs where created_at is between T-24h and T.

Step 2: Aggregate count of transitions per experiment_id.

Step 3: Count total Active vs Inactive from experiments table.

Step 4: Format into PDF/CSV and send via Email/Slack.

Technical Selection: Simple Cron Job (Python/Go).

Rationale: The data volume (1,000 experiments) does not justify Spark or Flink.

Wrap Up

Advanced Topics

Trade-offs: We chose Eventual Consistency for allocation. When an admin toggles an experiment, it might take ~30s for all allocation nodes to pick up the change. This is acceptable for A/B testing.

Reliability: If the DB goes down, the Allocation Service continues to use its last known cached config (Safe Failover).

Scalability: The 1,000 experiment limit is a "soft" limit. This architecture can easily scale to 100,000 experiments because deterministic hashing is O(1) relative to user traffic.

Security: Employee access to query experiments is governed by standard OIDC/IAM integration at the API Gateway.

Optimization: To handle 10x scale in user traffic, move the allocation logic into a Client-Side SDK. The SDK downloads the 1,000 experiment configs once and performs the hashing on the user's device, reducing server costs to nearly zero.