DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
The Question
Design

A/B Testing Platform Design

Design a scalable A/B testing system that manages up to 1,000 concurrent experiments and handles end-user allocation. The system must allow administrators to toggle experiments between active/inactive states and enable employees to query these states. A critical requirement is an automated daily audit report that calculates the total number of active/inactive experiments and tracks the frequency and details of all state transitions within a 24-hour window. Focus on how you would achieve low-latency allocation for millions of users while ensuring reliable audit logging for reporting.
PostgreSQL
MurmurHash
In-Memory Caching
REST
Batch Processing
CDC
RBAC
Questions & Insights

Clarifying Questions

Scale of End-Users: While there are 1,000 experiments, how many end-users are being allocated to these experiments? (Assumption: 10M DAU, requiring high-throughput allocation).
Allocation Logic: Is the allocation dynamic based on user attributes, or a simple percentage-based rollout? (Assumption: Support for both percentage rollouts and basic attribute targeting).
Reporting Depth: Does the "report" only cover experiment state changes, or does it also include performance metrics like click-through rates? (Assumption: Focus primarily on the requested state-change audit and experiment status counts, with extensibility for metrics).
Latency Requirements: What is the target latency for the allocation decision? (Assumption: Under 50ms for server-side or local evaluation via SDK).

Thinking Process

Deterministic Allocation: How can we assign millions of users to 1,000 experiments without storing every single user-to-experiment mapping? (Solution: Use deterministic hashing of user_id + experiment_id).
Configuration Distribution: How do we ensure the 1,000 experiments' "active" status is reflected instantly across the fleet? (Solution: Use a distributed cache or a localized configuration sidecar).
Audit Persistence: How do we track every state change for the end-of-day report? (Solution: An "Experiment Audit Log" table that records every status transition with a timestamp).
Reporting Efficiency: How do we generate the daily summary without scanning billions of user events? (Solution: A scheduled batch job that aggregates metadata from the Audit Log and Experiment table).

Bonus Points

Sticky Bucketing: Implementing MurmurHash3 with a seed to ensure a user stays in the same bucket even if other experiments are added or removed.
Conflict Detection: Logic to prevent overlapping experiments on the same user segment (Mutually Exclusive Experiments) using "Layers" or "Domains".
CDC for Reporting: Using Change Data Capture (CDC) from the primary database to a data warehouse to ensure the reporting engine never impacts the production management performance.
Design Breakdown

Functional Requirements

Core Use Cases:
Admins can Create, Read, Update, and Delete (CRUD) up to 1000 experiments.
Admins can toggle experiments between Active and Non-Active.
The system allocates users to active experiments based on defined weights.
Users can query the current state of any experiment.
Daily report generation: Total active/non-active, count of state changes, and specific transition details.
Scope Control:
In-scope: Experiment management, deterministic allocation logic, and state-change reporting.
Out-of-scope: Complex statistical analysis (p-values, confidence intervals), real-time user clickstream ingestion.

Non-Functional Requirements

Scale: Support 1,000 concurrent experiments and 10,000+ RPS for allocation queries.
Latency: Allocation decisions must be < 50ms.
Availability: 99.9% availability for the management UI; 99.99% for the allocation engine (must fail open or use cached defaults).
Consistency: Strong consistency for experiment management; eventual consistency (~seconds) for state updates reaching the allocation engine.
Security: RBAC for experiment management (who can toggle "Active").

Estimation

Traffic: 10M DAU. If each user triggers 5 allocation checks/day = 50M requests/day ≈ 600 QPS average (Peak 2k-3k QPS).
Storage (Metadata): 1,000 experiments * 10KB/exp ≈ 10MB. Very small.
Storage (Audit Logs): Assuming 1,000 experiments change state twice a day = 2,000 rows/day. Negligible storage.
Bandwidth: 10MB config updates distributed to 100 app nodes = 1GB daily traffic for config sync.

Blueprint

Concise Summary: A management service allows admins to define experiment metadata in a relational database. An allocation service (or SDK) uses deterministic hashing to assign users to experiments locally, while a daily batch job processes audit logs for reporting.
Major Components:
Experiment Service: Handles CRUD operations and stores experiment definitions.
Metadata DB (PostgreSQL): Stores experiment state and a dedicated audit table for state changes.
Allocation Engine: Lightweight service (or client SDK) that evaluates if a user is in an experiment using hash-based bucketing.
Reporting Job: A daily worker that aggregates state transitions from the audit logs.
Simplicity Audit: This design avoids complex event-streaming platforms (Kafka) because the scale of experiment state changes (1,000 experiments) is low enough for a simple relational database audit table.
Architecture Decision Rationale:
Why this?: Deterministic hashing eliminates the need for a massive "UserAssignment" table, making the system horizontally scalable and low-latency.
Functional Satisfaction: Covers management, allocation, and the specific "state change" reporting requirements.
Non-functional Satisfaction: High availability via stateless allocation nodes and low latency via in-memory caching of the 1,000 experiment configs.

High Level Architecture

Sub-system Deep Dive

Service

Experiment Management Service:
Topology: Stateless REST service deployed in Multi-AZ.
API Schema:
POST /experiments: Create experiment (Name, Description, Status, Traffic%).
PATCH /experiments/{id}/status: Toggle Active/Non-active.
GET /experiments: List all experiments for employees.
Idempotency: Use client_request_id for state changes to prevent double-toggling.
Allocation Service:
Logic: Bucket = MurmurHash3(user_id + experiment_id) % 100.
If Bucket < experiment.traffic_percentage and experiment.status == 'Active', user is in the "Treatment" group.
Sync: Pulls the full list of 1,000 experiments from the DB every 30 seconds and stores them in an in-memory Hashmap.

Storage

Access Pattern: Heavy read for allocation config (cached), low write for admin management. High reliability required for audit logs.
Database Table Design:
Table: `experiments
id (UUID, PK)
name (String)
status (Enum: ACTIVE, INACTIVE)
traffic_percent (Int)
updated_at (Timestamp)
Table: `experiment_audit_logs
id (BigInt, PK)
experiment_id (FK)
old_status (Enum)
new_status (Enum)
changed_by (UserID)
created_at (Timestamp, Index for Reporting)
Technical Selection: PostgreSQL.
Rationale: Handles relational integrity for 1,000 records effortlessly. ACID compliance is critical for ensuring the audit log is written whenever a status changes.

Cache

Purpose: To avoid DB hits for every user allocation request.
Implementation: Local In-Memory Cache (Guava or Caffeine).
Refresh Strategy: Poll the DB every 30-60s. Since there are only 1,000 experiments, the entire payload is < 1MB, making frequent polling very cheap.

Data Processing

Reporting Model: Batch Processing.
Processing DAG:
Step 1: Scan experiment_audit_logs where created_at is between T-24h and T.
Step 2: Aggregate count of transitions per experiment_id.
Step 3: Count total Active vs Inactive from experiments table.
Step 4: Format into PDF/CSV and send via Email/Slack.
Technical Selection: Simple Cron Job (Python/Go).
Rationale: The data volume (1,000 experiments) does not justify Spark or Flink.
Wrap Up

Advanced Topics

Trade-offs: We chose Eventual Consistency for allocation. When an admin toggles an experiment, it might take ~30s for all allocation nodes to pick up the change. This is acceptable for A/B testing.
Reliability: If the DB goes down, the Allocation Service continues to use its last known cached config (Safe Failover).
Scalability: The 1,000 experiment limit is a "soft" limit. This architecture can easily scale to 100,000 experiments because deterministic hashing is O(1) relative to user traffic.
Security: Employee access to query experiments is governed by standard OIDC/IAM integration at the API Gateway.
Optimization: To handle 10x scale in user traffic, move the allocation logic into a Client-Side SDK. The SDK downloads the 1,000 experiment configs once and performs the hashing on the user's device, reducing server costs to nearly zero.