The Question
ML DesignML-Driven Adaptive Rate Limiting
How would you design an intelligent, large-scale rate limiting system that uses machine learning to dynamically adjust request quotas based on user behavior and system health, ensuring high availability while mitigating sophisticated bot attacks?
TCN
MTL
XGBoost
Thompson Sampling
Logistic Regression
Questions & Insights
Clarifying Questions
Business Goal: The primary North Star is Availability vs. False Positive Rate (FPR). We want to maximize legitimate request throughput while keeping system utilization below 80%.
Constraints & Scale:
Scale: 10M+ QPS across global regions.
Latency: The rate-limiting decision (allow/deny) must add <2ms to the critical path. ML inference cannot happen synchronously for every request unless it's an edge-cached score.
Corpus: Hundreds of millions of unique IP/User-ID combinations.
Edge Cases: Zero-day bot attacks, high-volume legitimate "flash crowds," and IP-spoofing/rotation.
Assumptions: I assume a tiered architecture where a fast "Static Layer" (Redis-based) is governed by an "Intelligence Layer" (ML-driven) that updates quotas asynchronously.
Thinking Process
The Bottleneck: Synchronous ML inference is too slow for a gateway. The solution must involve asynchronous parameter updates. The ML model calculates a "Risk Score" or "Dynamic Quota" which is then cached in a fast lookup store (Redis).
Retrieval vs. Ranking: In this context, "Retrieval" is identifying high-risk segments/IPs, and "Ranking" is determining the exact throttling intensity.
Scaling: Use a distributed stream processing engine (Flink) to aggregate features like "Request Count per IP in last 60s" in real-time.
Feedback Loop: We need a way to label data. If a user is challenged (CAPTCHA) and passes, that's a negative label for "malicious."
Elite Bonus Points
Online-Offline Consistency (Lambda Architecture): Using the same feature definitions for real-time aggregation (Flink) and batch historical training (Spark) to prevent training-serving skew.
Counterfactual Evaluation: Estimating what the system load would have been if we hadn't throttled specific traffic, using a small "control" bucket of un-throttled traffic.
Multi-Armed Bandits for Thresholding: Instead of a fixed threshold, use Thompson Sampling to explore the optimal rate limit for a specific user segment that minimizes system cost without hurting engagement.
Differential Privacy: Ensuring that our rate-limiting features (like IP history) are stored and processed in a way that complies with GDPR/CCPA.
Design Breakdown
Functional Reqs
Dynamic Throttling: Users should be allowed or denied based on a combination of fixed rules and an ML risk score.
Tiered Access: Support different quotas for guest vs. authenticated vs. premium users.
Challenge-Response Integration: Integration with CAPTCHA or MFA for "gray-listed" requests.
Non-Functional Reqs
Ultra-Low Latency: <2ms P99 overhead for the "Allow/Deny" decision.
High Availability: If the ML service fails, the system must fail-open to a safe static default.
Eventual Consistency: It is acceptable for a risk score update to propagate within 1-5 seconds.
ML Problem Framing
ML Objective: Predict P(\text{Abusive} | \text{Request\_Stream}).
ML Category: Binary Classification (Abusive vs. Legitimate) or Anomaly Detection.
Input/Output/Label:
Input: User/IP history, request headers, entropy of the request path, system load metrics.
Output: A risk score [0, 1].
Label: 1 if the request sequence matches known bot patterns or results in a failed challenge; 0 if the user passes a CAPTCHA or behaves like a human.
Data Prep & Features
Data Pipeline: Kafka for request logs, Flink for real-time windowed aggregations.
Feature Engineering:
User/Item Features: Account age, historical average QPS, geo-location stability.
Context/Sequence Features: Request velocity (1m, 5m, 1h windows), header order (bots often have static header signatures), Inter-arrival time variance (humans have high variance; bots are often periodic).
Advanced Features: Graph-based signals (Is this IP part of a known botnet cluster?), Embedding of the User-Agent string.
Feature Store: Redis for low-latency online feature retrieval during the scoring phase.
Model Architecture
Model Choice:
Level 1 (Fast): Logistic Regression or simple Heuristics for real-time scoring.
Level 2 (Deep): Temporal Convolutional Networks (TCN) or LSTMs to analyze the sequence of requests. Bots exhibit non-random patterns over time.
Loss Function: Weighted Cross-Entropy (to penalize False Positives more heavily, as blocking a real user is costly).
Architecture: A Multi-Task Learning (MTL) model that predicts both "Bot Probability" and "Future System Load" to determine if we should tighten limits globally.
Training & Serving
Optimization: Distributed training using Adam optimizer. Focus on Precision-Recall AUC.
Model Serving:
Asynchronous Scoring: The model consumes a stream of features and periodically (every few seconds) writes a "Max-QPS" value to a Redis hash for each active User/IP.
Sidecar Inference: The Gateway checks Redis. If no entry exists, it uses a default.
Addressing Challenges:
Position Bias: Not applicable here, but Selection Bias is huge (we only have labels for people we didn't block or those we challenged). Use "exploration" traffic to mitigate this.
System Architecture
Pipeline Deep Dive
Data Pipeline
Ingestion: Every request produces a log event. High-volume events are sampled for training but 100% are sent to Flink for counting.
Storage: S3 stores PB-scale logs partitioned by
date/hour for historical analysis and re-training.Feature Pipeline
Extraction: We use "Slippery Windows" in Flink. For every incoming request, we update the counts for that User-ID/IP for the last 10 seconds, 1 minute, and 10 minutes.
Consistency: To avoid skew, the same Java/Scala code used in Flink is used in Spark to generate features for offline training.
Training Pipeline
Offline Training: We train a new model daily. We use a Time-based split (train on Monday-Friday, validate on Saturday) to ensure the model generalizes to future bot patterns.
Orchestration: Airflow manages the DAG, ensuring the Spark job completes before the training job starts.
Serving Pipeline
Retrieval: When a request hits the Gateway, it does a single
GET to Redis.Ranking (Thresholding): The "Inference Service" runs out-of-band. It pulls features from Redis, calculates the risk score, and calculates:
Allowed_QPS = Base_Quota * (1 - Risk_Score). This value is pushed to the Quota Cache.Evaluation Pipeline
Online Experimentation: Use Shadow Mode where the ML model calculates a limit, but the Gateway still uses the static rule. We compare the "would-be-blocked" traffic against actual abuse.
Interleaved Testing: Not applicable; we use standard A/B buckets based on User-ID.
Monitoring Pipeline
Drift: Monitor the distribution of Risk Scores. If the mean score shifts from 0.1 to 0.4 suddenly, a new botnet might be active or the model is hallucinating.
System Health: If Redis latency for the Quota Cache exceeds 1ms, the Gateway falls back to a static hard-coded limit.
Wrap Up
Advanced Topics
Offline Metrics: AUC-PR (Precision-Recall) is preferred over AUC-ROC because the data is highly imbalanced (few bots vs. many users).
Online Metrics:
Availability: % of successful 2xx responses for legitimate users.
Abuse Rate: % of downstream server errors (5xx) caused by overload.
Failure Modes:
Cache Stampede: If Redis goes down, all Gateways default to a safe-mode static limit.
Over-blocking: If a legitimate "Flash Crowd" (e.g., product launch) occurs, the model might flag it as a bot. Mitigation: Add a "Global System Capacity" feature so the model knows the system can handle the spike.
Responsible AI: Ensure the model doesn't disproportionately block users from specific geographic regions (Bias audit).