The Question
DesignScalable Strategy Backtesting Platform
Design an internal platform for quantitative researchers to perform large-scale historical backtesting. The system must allow users to submit custom Python-based strategy code, execute these jobs in parallel across a scalable compute fleet, and ensure strict resource isolation and security for untrusted code execution. Consider how to efficiently manage and serve terabytes of historical market data while providing comprehensive performance reporting and job lifecycle tracking.
S3
Parquet
Docker
Kubernetes Jobs
PostgreSQL
SQS
Python
gVisor
Questions & Insights
Clarifying Questions
What is the expected scale of the system? (Assumed: ~100 internal researchers, ~5,000 backtests per day, each ranging from minutes to hours in execution time).
What is the granularity and volume of historical market data? (Assumed: Minute-level bars for equities and crypto, stored in compressed Parquet format on Object Storage, totaling ~10TB).
What programming languages must be supported? (Assumed: Python is the primary language due to its dominant ecosystem in quantitative finance like Pandas, NumPy, and PyTorch).
What is the definition of "secure code execution"? (Assumed: Isolation to prevent one researcher's infinite loop or memory leak from crashing the system, and preventing unauthorized access to other researchers' strategies or sensitive credentials).
How are results consumed? (Assumed: A JSON/CSV performance report and a series of time-series charts stored for later retrieval).
Thinking Process
Core Bottleneck: Resource Contention and I/O. Backtesting is both CPU-bound (complex math) and I/O-bound (loading massive historical datasets).
Step 1: Containerized Execution. How do we run arbitrary code safely? We package each backtest into a transient Docker container with strict CPU/Memory limits.
Step 2: Data Locality. How do we prevent the network from becoming the bottleneck? We use a "Pull-once" or "Sidecar" approach where market data is cached locally on the worker node or streamed efficiently using partitioned Parquet files.
Step 3: Asynchronous Orchestration. How do we handle long-running jobs? We use a Task Queue (Producer-Consumer) pattern to decouple the API from the heavy lifting.
Step 4: Result Persistence. How do we store metrics? We separate metadata (job status) in a RDBMS and raw performance logs in Object Storage.
Bonus Points
Vectorized Backtesting Engine: Suggesting the use of libraries like
VectorBT or Numba to execute strategies across the entire time-series at once rather than event-driven loops, improving performance by 10-100x.Data Versioning & Point-in-Time Correctness: Implementing a "Snapshot" mechanism for market data to ensure that a backtest run today produces the exact same results six months from now, even if historical data was corrected (e.g., corporate actions/splits).
Spot Instance Orchestration: Utilizing AWS Spot Instances or GCP Preemptible VMs for worker nodes with a "Checkpoint-and-Resume" logic to reduce compute costs by up to 90%.
Sidecar Data Prefetching: Using a sidecar process in the pod to pre-fetch the next required chunk of market data while the main strategy code is processing the current chunk.
Design Breakdown
Functional Requirements
Researchers can upload Python scripts/notebooks.
System executes the script against specified symbols and date ranges.
System generates a report (Sharpe ratio, Max Drawdown, CAGR, etc.).
Users can track the status of their backtests (Pending, Running, Completed, Failed).
Support for parallel execution of multiple backtests.
Non-Functional Requirements
Isolation: Each backtest must run in a sandbox; no shared state between runs.
Scalability: Ability to scale worker nodes based on queue depth.
Reliability: If a worker node fails, the job should be retried or marked failed gracefully.
Observability: Logs from the researcher's code must be captured and presented in the UI.
Estimation
Storage: 10 years of 1-minute data for 5,000 tickers ≈ 2.6 billion rows. Compressed Parquet: ~500 GB - 1 TB.
Compute: If 1 backtest takes 10 minutes on 2 cores, and we want 100 concurrent tests: 200 Cores needed.
Egress/Ingress: Loading 1 year of data for one ticker (~500k rows) is ~20MB. A broad backtest might pull 2-5GB of data.
Blueprint
Concise Summary: A microservices-based architecture where an API accepts code uploads, stores them in Object Storage, and queues a job. A pool of containerized workers pulls the code and market data to execute the test.
Major Components:
API Service: Handles user authentication, code uploads, and job metadata management.
Task Queue: Manages the distribution of backtest jobs to available workers.
Worker Pool: Scalable fleet of containers that execute the untrusted code in isolation.
Object Storage: Holds the historical market data (Source of Truth) and the generated reports.
Metadata DB: Stores job status, user info, and pointers to results.
Simplicity Audit: This design avoids complex "Streaming" architectures. It treats backtesting as a batch job, which is the most reliable and easiest-to-debug pattern for researchers.
Architecture Decision Rationale:
Why this architecture?: It provides the strongest security boundary (Docker) and the most flexible scaling (K8s Jobs or Celery workers).
Functional Satisfaction: Covers upload, execution, and reporting via a decoupled async flow.
Non-functional Satisfaction: Scalability is handled by the queue; isolation is handled by the container runtime.
High Level Architecture
Sub-system Deep Dive
Service
Topology & Scaling:
API Service: Stateless, deployed in Multi-AZ. Scales on CPU/Request count.
Worker Fleet: Deployed on Kubernetes (EKS/GKE). Uses K8s Jobs for one-off execution or a persistent pool of workers (Celery/RQ). Scaling signal is the Queue Depth.
API Schema Design:
POST /v1/backtest: Uploads code (multipart) + params (symbols, dates). Returns job_id.GET /v1/backtest/{id}: Returns status and metadata.GET /v1/backtest/{id}/logs: Streams logs from Object Storage.Idempotency:
client_request_token header to prevent double submission of the same backtest.Resilience & Reliability:
Dead Letter Queue (DLQ): Jobs that crash 3 times are moved to DLQ for manual inspection.
Timeouts: Hard limit (e.g., 2 hours) on container execution to prevent zombie processes.
Observability:
Metrics: Track "Time to Start" (Queue Latency) and "Backtest Duration."
Structured Logging: All
stdout/stderr from the researcher's code is redirected to a log file and uploaded to S3.Security:
gVisor / Kata Containers: For higher security isolation if the internal environment is "zero-trust."
Network Policy: Workers have no outbound internet access except to the Market Data S3 bucket.
Storage
Access Pattern:
Heavy sequential reads of market data (WORM: Write Once Read Many).
Frequent updates to Job Status (OLTP).
Database Table Design (Metadata DB):
backtest_jobs: id (UUID), user_id, status (enum), code_s3_path, result_s3_path, created_at, params (JSONB).Technical Selection:
PostgreSQL: For metadata due to ACID requirements and JSONB support for flexible strategy parameters.
S3 / Object Storage: For market data (Parquet) and result artifacts (HTML/PDF/CSV).
Distribution Logic:
Market Data is partitioned by
symbol and year/month in S3: s3://market-data/equities/AAPL/2023/01.parquet.This allows workers to fetch only the specific files needed for the requested date range.
Cache
Purpose & Justification: Market data is frequently reused (e.g., multiple researchers testing on AAPL). Reading from S3 every time is slow and adds cost.
Implementation:
Local SSD Cache: Worker nodes use a
hostPath volume in Kubernetes to cache Parquet files. Before fetching from S3, the worker checks the local disk.LRU Eviction: Simple script to clear the oldest data when the disk hits 80% capacity.
Failure Handling: If the cache is unavailable, the system falls back to a direct S3 read.
Messaging
Purpose & Decoupling: Decouples the API from the long-running execution.
Event / Topic Schema:
Topic:
backtest_tasks. Payload:
{job_id, s3_code_path, symbols: [], start_date, end_date}.Technical Selection: Redis (using BullMQ or Celery) or AWS SQS.
Rationale: SQS is preferred for MVP due to zero operational overhead and built-in visibility timeouts.
Data Processing
Processing Model: Each worker runs a Python Interpreter inside a Docker container.
Sandbox Configuration:
No-Network: Disable network interfaces in the container.Resource Quotas: cpu: 2, memory: 4Gi.Read-Only Root FS: Prevent researchers from modifying the worker environment.Technical Selection: Docker + Kubernetes Jobs. This is the industry standard for isolated, reproducible batch compute.
Wrap Up
Advanced Topics
Trade-offs (Latency vs. Cost): We choose asynchronous processing. Users don't get instant results, but the system remains stable under load and we save money by not over-provisioning workers.
Bottleneck Analysis: The primary bottleneck is S3 to Worker throughput.
Optimization*: Use S3 Select or DuckDB on the worker to filter data before* loading it into the Python memory space, significantly reducing RAM usage.
Security & Privacy: Even for internal researchers, "Code Injection" into the host is a risk. We use User Namespaces in Docker to ensure the process inside the container does not have root privileges on the host.
Distinguishing Insights:
Market Data API: Instead of researchers writing raw S3 logic, provide a pre-installed Python library (
internal_data_lib) in the worker image. This abstracts away the S3 paths and Parquet loading, ensuring researchers use the most efficient data-loading patterns.