The Question
DesignA/B Testing Platform Design
Design a scalable A/B testing system that manages up to 1,000 concurrent experiments and handles end-user allocation. The system must allow administrators to toggle experiments between active/inactive states and enable employees to query these states. A critical requirement is an automated daily audit report that calculates the total number of active/inactive experiments and tracks the frequency and details of all state transitions within a 24-hour window. Focus on how you would achieve low-latency allocation for millions of users while ensuring reliable audit logging for reporting.
PostgreSQL
MurmurHash
In-Memory Caching
REST
Batch Processing
CDC
RBAC
Questions & Insights
Clarifying Questions
Scale of End-Users: While there are 1,000 experiments, how many end-users are being allocated to these experiments? (Assumption: 10M DAU, requiring high-throughput allocation).
Allocation Logic: Is the allocation dynamic based on user attributes, or a simple percentage-based rollout? (Assumption: Support for both percentage rollouts and basic attribute targeting).
Reporting Depth: Does the "report" only cover experiment state changes, or does it also include performance metrics like click-through rates? (Assumption: Focus primarily on the requested state-change audit and experiment status counts, with extensibility for metrics).
Latency Requirements: What is the target latency for the allocation decision? (Assumption: Under 50ms for server-side or local evaluation via SDK).
Thinking Process
Deterministic Allocation: How can we assign millions of users to 1,000 experiments without storing every single user-to-experiment mapping? (Solution: Use deterministic hashing of
user_id + experiment_id).Configuration Distribution: How do we ensure the 1,000 experiments' "active" status is reflected instantly across the fleet? (Solution: Use a distributed cache or a localized configuration sidecar).
Audit Persistence: How do we track every state change for the end-of-day report? (Solution: An "Experiment Audit Log" table that records every status transition with a timestamp).
Reporting Efficiency: How do we generate the daily summary without scanning billions of user events? (Solution: A scheduled batch job that aggregates metadata from the Audit Log and Experiment table).
Bonus Points
Sticky Bucketing: Implementing MurmurHash3 with a seed to ensure a user stays in the same bucket even if other experiments are added or removed.
Conflict Detection: Logic to prevent overlapping experiments on the same user segment (Mutually Exclusive Experiments) using "Layers" or "Domains".
CDC for Reporting: Using Change Data Capture (CDC) from the primary database to a data warehouse to ensure the reporting engine never impacts the production management performance.
Design Breakdown
Functional Requirements
Core Use Cases:
Admins can Create, Read, Update, and Delete (CRUD) up to 1000 experiments.
Admins can toggle experiments between
Active and Non-Active.The system allocates users to active experiments based on defined weights.
Users can query the current state of any experiment.
Daily report generation: Total active/non-active, count of state changes, and specific transition details.
Scope Control:
In-scope: Experiment management, deterministic allocation logic, and state-change reporting.
Out-of-scope: Complex statistical analysis (p-values, confidence intervals), real-time user clickstream ingestion.
Non-Functional Requirements
Scale: Support 1,000 concurrent experiments and 10,000+ RPS for allocation queries.
Latency: Allocation decisions must be < 50ms.
Availability: 99.9% availability for the management UI; 99.99% for the allocation engine (must fail open or use cached defaults).
Consistency: Strong consistency for experiment management; eventual consistency (~seconds) for state updates reaching the allocation engine.
Security: RBAC for experiment management (who can toggle "Active").
Estimation
Traffic: 10M DAU. If each user triggers 5 allocation checks/day = 50M requests/day ≈ 600 QPS average (Peak 2k-3k QPS).
Storage (Metadata): 1,000 experiments * 10KB/exp ≈ 10MB. Very small.
Storage (Audit Logs): Assuming 1,000 experiments change state twice a day = 2,000 rows/day. Negligible storage.
Bandwidth: 10MB config updates distributed to 100 app nodes = 1GB daily traffic for config sync.
Blueprint
Concise Summary: A management service allows admins to define experiment metadata in a relational database. An allocation service (or SDK) uses deterministic hashing to assign users to experiments locally, while a daily batch job processes audit logs for reporting.
Major Components:
Experiment Service: Handles CRUD operations and stores experiment definitions.
Metadata DB (PostgreSQL): Stores experiment state and a dedicated audit table for state changes.
Allocation Engine: Lightweight service (or client SDK) that evaluates if a user is in an experiment using hash-based bucketing.
Reporting Job: A daily worker that aggregates state transitions from the audit logs.
Simplicity Audit: This design avoids complex event-streaming platforms (Kafka) because the scale of experiment state changes (1,000 experiments) is low enough for a simple relational database audit table.
Architecture Decision Rationale:
Why this?: Deterministic hashing eliminates the need for a massive "UserAssignment" table, making the system horizontally scalable and low-latency.
Functional Satisfaction: Covers management, allocation, and the specific "state change" reporting requirements.
Non-functional Satisfaction: High availability via stateless allocation nodes and low latency via in-memory caching of the 1,000 experiment configs.
High Level Architecture
Sub-system Deep Dive
Service
Experiment Management Service:
Topology: Stateless REST service deployed in Multi-AZ.
API Schema:
POST /experiments: Create experiment (Name, Description, Status, Traffic%).PATCH /experiments/{id}/status: Toggle Active/Non-active. GET /experiments: List all experiments for employees.Idempotency: Use
client_request_id for state changes to prevent double-toggling.Allocation Service:
Logic:
Bucket = MurmurHash3(user_id + experiment_id) % 100.If
Bucket < experiment.traffic_percentage and experiment.status == 'Active', user is in the "Treatment" group.Sync: Pulls the full list of 1,000 experiments from the DB every 30 seconds and stores them in an in-memory Hashmap.
Storage
Access Pattern: Heavy read for allocation config (cached), low write for admin management. High reliability required for audit logs.
Database Table Design:
Table: `experiments
id (UUID, PK)name (String)status (Enum: ACTIVE, INACTIVE)traffic_percent (Int)updated_at (Timestamp)Table: `experiment_audit_logs
id (BigInt, PK)experiment_id (FK)old_status (Enum)new_status (Enum)changed_by (UserID)created_at (Timestamp, Index for Reporting)Technical Selection: PostgreSQL.
Rationale: Handles relational integrity for 1,000 records effortlessly. ACID compliance is critical for ensuring the audit log is written whenever a status changes.
Cache
Purpose: To avoid DB hits for every user allocation request.
Implementation: Local In-Memory Cache (Guava or Caffeine).
Refresh Strategy: Poll the DB every 30-60s. Since there are only 1,000 experiments, the entire payload is < 1MB, making frequent polling very cheap.
Data Processing
Reporting Model: Batch Processing.
Processing DAG:
Step 1: Scan
experiment_audit_logs where created_at is between T-24h and T.Step 2: Aggregate count of transitions per
experiment_id.Step 3: Count total
Active vs Inactive from experiments table.Step 4: Format into PDF/CSV and send via Email/Slack.
Technical Selection: Simple Cron Job (Python/Go).
Rationale: The data volume (1,000 experiments) does not justify Spark or Flink.
Wrap Up
Advanced Topics
Trade-offs: We chose Eventual Consistency for allocation. When an admin toggles an experiment, it might take ~30s for all allocation nodes to pick up the change. This is acceptable for A/B testing.
Reliability: If the DB goes down, the Allocation Service continues to use its last known cached config (Safe Failover).
Scalability: The 1,000 experiment limit is a "soft" limit. This architecture can easily scale to 100,000 experiments because deterministic hashing is O(1) relative to user traffic.
Security: Employee access to query experiments is governed by standard OIDC/IAM integration at the API Gateway.
Optimization: To handle 10x scale in user traffic, move the allocation logic into a Client-Side SDK. The SDK downloads the 1,000 experiment configs once and performs the hashing on the user's device, reducing server costs to nearly zero.