The Question
DesignA/B Testing Platform
Design an A/B testing and experimentation platform for a large-scale product. The system should support experiment configuration, traffic splitting and user assignment, real-time event tracking, and statistical significance analysis to drive data-informed product decisions.
PostgreSQL
Redis
REST API
Questions & Insights
Thinking Process
The core challenge is balancing high-frequency experiment evaluation (allocation) with the need for a precise audit trail for state changes to satisfy reporting requirements.
Question 1: How can we ensure 1,000 experiments are evaluated with sub-millisecond latency for millions of users?
Answer*: Use Deterministic Hashing** on the client-side or at the edge. Instead of storing assignments in a DB, use
hash(user_id + experiment_id) % 100 to determine the bucket based on the configuration stored in a distributed cache.Question 2: How do we track state changes precisely for the daily report?
Answer*: Implement an Audit/Event Log** table. Every time an admin toggles an experiment, a record is appended to a change-log table. This allows us to count transitions and current state without complex diffing.
Question 3: How do we ensure the SDK/Evaluation logic has the latest configuration without DDOSing the database?
Answer*: Use a Push-style configuration update** (via Redis Pub/Sub or Long Polling) or a simple TTL-based pull from a global cache.
Bonus Points
Layered Targeting Engine: Implement a "Constraint-Based Targeting" model where experiments are evaluated in a specific order or hierarchy (Mutually Exclusive Groups) to prevent interaction effects between overlapping experiments.
Read-Optimized State Snapshotting: For the reporting worker, instead of scanning the entire history, maintain a "Daily State" table that is updated transactionally with the experiment metadata to make report generation O(1) for count queries.
Edge Distribution: Distribute the experiment configuration to CDNs (e.g., Cloudflare Workers) to move the "Allocation" logic as close to the user as possible, reducing latency to near zero.
Design Breakdown
Functional Requirements
Admins can Create/Update/Delete up to 1,000 experiments.
Admins can toggle experiments between
Active and Inactive.Internal staff can query the current state of experiments.
System must track every status change (Active -> Inactive and vice versa).
System must generate a daily report:
Current Active vs. Inactive counts.
Number of experiments that changed state.
Total frequency of changes per experiment.
Non-Functional Requirements
High Availability: Experiment evaluation must not fail if the Admin DB is down.
Low Latency: Assignment logic should not add more than 10-20ms to the user request.
Consistency: The daily report must be 100% accurate (Audit integrity).
Scalability: While only 1,000 experiments exist, the number of "Evaluation Requests" could scale to millions per second.
Estimation
Experiment Metadata: 1,000 experiments * 2KB/config = 2MB (Fits easily in any cache/RAM).
State Changes: Even if every experiment changes status 10 times a day = 10,000 events.
Storage: 10,000 events * 100 bytes = 1MB/day. Postgres is more than sufficient.
Read Traffic: Internal queries (staff) are low (10-100 QPS). Evaluation traffic (users) can be 10^5 QPS.
Blueprint
Concise Summary: A central Admin Service manages experiment configurations in a relational database while capturing all state transitions in an Audit Log. These configurations are synchronized to a Redis cache for high-speed access by SDKs or internal query tools.
Major Components:
Admin Service: Provides a REST API for experiment CRUD and status toggling.
Postgres DB: Acts as the source of truth for metadata and the append-only log of state changes.
Redis: Serves as a low-latency read-replica for current experiment configurations.
Report Worker: A scheduled process that aggregates the audit logs to generate the daily summary.
Simplicity Audit: This architecture avoids complex stream processing (like Kafka/Flink) because the volume of state changes (for 1,000 experiments) is trivial. Standard relational triggers or service-level logging are sufficient.
High Level Architecture
Sub-system Deep Dive
Service
Topology & Scaling:
Admin Service: A stateless Spring Boot or Go microservice. Scale to 2 replicas for HA.
Evaluation Proxy (Optional): If clients cannot use an SDK, a proxy service fetches config from Redis and performs the hashing logic.
API Spec:
POST /experiments: Create new experiment.PATCH /experiments/{id}/status: Toggle Active/Inactive. (Updates DB + Appends to status_history + Updates Redis).GET /experiments: List current states (Reads from Redis).Storage
Data Model:
experiments: id (UUID), name, description, status (enum), updated_at.status_history: id, experiment_id, old_status, new_status, timestamp.Database Logic:
Use a Database Transaction when updating status to ensure the
experiments table and status_history table stay in sync.Index on
experiment_id and timestamp in the history table for fast reporting.Cache
Data Structure: Redis Hash
experiments_map. Key: experiment_id, Value: JSON string of config.TTL: No TTL (static config). Update via "Cache-aside" write-through from the Admin Service.
Eviction: Least Recently Used (LRU), though 1,000 items will never trigger eviction in a standard Redis instance.
Data Processing
Component: Report Worker (Cron Job).
Logic:
Runs at 23:59 UTC.
SELECT count(*) FROM status_history WHERE timestamp > today: Identifies frequency of changes.SELECT count(DISTINCT experiment_id) FROM status_history WHERE timestamp > today: Identifies how many experiments changed.Result is stored in a
daily_reports table.Wrap Up
Advanced Topics
Trade-offs: We chose Strong Consistency for the Admin state (Postgres) over Eventual Consistency. This ensures that the daily report is never "wrong" due to dropped messages, though it slightly increases Admin API latency.
Bottlenecks: The primary bottleneck isn't the 1,000 experiments, but the Redis Read Volume if millions of clients poll for config.
Optimization: Clients should cache the config locally for 1-5 minutes and use an ETag/If-None-Match header to only download updates.
Failure Handling:
Redis Down: If Redis is unavailable, the SDK/Service can fallback to a local cached version or query Postgres (circuit-breaker enabled).
Postgres Down: The Admin UI becomes read-only, but existing experiments continue to run as their state is already in Redis.
Alternatives: Instead of a Daily Report Worker, one could use a Database View to calculate stats in real-time, but a batch job is safer for historical snapshotting.