The Question
ML DesignIntelligent Adaptive Rate Limiter
Design an intelligent rate-limiting system for a global-scale ML inference API (e.g., Search or Recommendations). The system must differentiate between legitimate high-burst users and malicious scrapers or DDoS actors. Focus on achieving sub-2ms decision latency while handling 500k+ QPS. Detail the data pipelines for real-time feature extraction, the model architecture for risk scoring, and how to minimize false positives through offline/online evaluation strategies. Explain how you would manage the trade-off between system protection and user experience using an ML-driven dynamic thresholding approach.
LightGBM
Flink
Kafka
Redis
Spark
Feast
Envoy
Prometheus
Questions & Insights
Clarifying Questions
Business Goal: Is the goal to prevent malicious DDoS attacks (security), manage costs for expensive ML inference (LLM/Search), or ensure fair usage across multi-tenant clients?
Assumption*: We are building an Intelligent Rate Limiter for a high-scale ML Recommendation API** to prevent scraping of proprietary data and prioritize high-value users during traffic spikes.
Constraints & Scale: What is the traffic volume and the latency budget?
Assumption*: 500k QPS (Query Per Second), 100M daily active users (DAU). The rate-limiting decision must be made in < 2ms** to avoid impacting the end-to-end P99 latency.
Data Freshness: How quickly must we adapt to new attack patterns?
Assumption: We need near real-time adaptation (minutes) for emerging botnets.
Edge Cases: How do we handle cold-start users (new IPs) and ensure we don't block high-value customers who just happen to have high burstiness?
Assumption: We use a tiered fallback system where new users default to a conservative static limit.
Thinking Process
Identify the Bottleneck: A standard "Token Bucket" is static and doesn't distinguish between a loyal user and a bot scraper. The ML challenge is to predict the Intent/Value of a request in micro-seconds.
Retrieval vs. Ranking: This isn't a retrieval problem. It's a high-speed Binary Classification problem (Allow vs. Block/Throttle).
Architecture Trade-off: We cannot call a heavy DNN for every API request. I need a hybrid approach: a Fast-Path (local cache/heuristics) and a Slow-Path (asynchronous ML scoring to update the Fast-Path's parameters).
Scaling: Feature engineering must be done on a stream (Flink) to capture burstiness metrics across distributed nodes.
Elite Bonus Points
Counterfactual Evaluation: How do we know how much revenue we lost by blocking a specific cohort? I would propose "Lasso Logging" (randomly allowing 0.1% of blocked traffic) to collect ground-truth labels for evaluation.
Feature Hashing for High-Cardinality IPs: To handle millions of IPs without exploding the model size, use the Hashing Trick to map IPs into a fixed-length embedding space.
System-Load Aware Thresholding: Instead of a fixed probability threshold (e.g., block if p > 0.5), make the threshold a function of current system CPU/Memory utilization (Dynamic PID control).
Embedding-based Bot Detection: Use Graph Neural Networks (GNNs) offline to identify clusters of "sybil" accounts/IPs and push those clusters to a Bloom Filter in the serving layer.
Design Breakdown
Requirements
Product Goal: Maximize API availability and protect proprietary content from scrapers.
Success Metrics:
Online Metrics: API Availability (Uptime), False Positive Rate (FPR - blocking real users), Scraper detection rate.
Offline Metrics: AUC-ROC for bot detection, Precision @ 99% Recall (to minimize user friction).
Guardrail Metrics: P99 Latency overhead (< 2ms), Cache Hit Rate for rate-limit rules.
System Constraints: 500k QPS, globally distributed, sub-millisecond local state access.
Data Availability: Request logs (IP, User-Agent, Path), User metadata (account age, tier), historical traffic patterns.
ML Problem Framing
ML Task Type: Binary Classification (Intention: Scraper vs. Human) and Regression (Predicted Request Value).
Prediction Target: P(\text{IsAbusive} | \text{Request, User, Context}).
Inputs:
User Features: Account age, historical spend, "Human" score from CAPTCHA.
Context Features: IP Reputation, ASN (Autonomous System Number), User-Agent, Time-of-day.
Behavioral Features: Request frequency (last 1s, 1m, 1h), entropy of request paths (scrapers are often systematic).
Outputs: A "Risk Score" [0, 1] used to adjust the Token Bucket replenishment rate or trigger a CAPTCHA.
ML Challenges: Label Delay (we only know it was a scraper after it has crawled 10k pages) and Adversarial Drift (bots change patterns).
Design Summary & MVP
Concise Summary: A two-tier rate limiter: a distributed Token Bucket (Redis) using dynamically updated limits provided by an asynchronous ML scoring service.
Model Architecture & Selection:
Baseline Model: Static rules + Heuristics (e.g., > 100 requests/sec = Block).
Target Model: LightGBM or Logistic Regression for the risk score.
Choice Rationale: Need extreme interpretability and fast inference. Tree models excel at tabular "frequency" data common in traffic logs.
ML Life Cycle Summary: Logs -> Flink (Windowed Features) -> S3/Feature Store -> LightGBM Training -> Redis (Score Deployment) -> Envoy Sidecar (Enforcement).
Simplicity Audit: Avoids real-time model inference in the request path. Instead, the model calculates scores offline or asynchronously and updates a lookup table (Redis), keeping the hot path O(1).
Architecture Decision Rationale:
Asynchronous Scoring: Decouples ML latency from API response time.
Feature Hashing: Handles the massive cardinality of IP addresses.
System Architecture
Pipeline Deep Dive
Data Pipeline
Data Source: Application logs (NGINX/Envoy) and User Event logs.
Data Ingestion: Kafka acts as the backbone. We use a Schema Registry to ensure the Rate Limiter service can parse request attributes consistently.
Data Storage: S3 for long-term storage (Parquet format, partitioned by
date/hour).Data Processing: Flink calculates Sliding Window Aggregates (e.g.,
requests_per_ip_last_60s). This is critical for catching "low and slow" scrapers.Data Quality: De-duplication in Kafka and schema validation to prevent malformed requests from poisoning the model.
Feature Pipeline
Feature Definition:
Velocity: Requests per (1s, 10s, 1m) for (IP, UserID, API Key).
Metadata: Entropy of the
User-Agent string, CIDR reputation.Behavioral: Ratio of 404/403 errors to 200s (scrapers hit missing pages more often).
Online vs. Offline: Flink pushes real-time velocities to Redis. Spark pre-computes long-term user "trust scores" weekly.
Consistency: Use a Feature Store (e.g., Feast) to ensure the same code generates features for both training and serving.
Model Architecture
Problem Formulation: Supervised Binary Classification.
Target Model: LightGBM.
Rationale: It handles tabular data with missing values (e.g., new users missing history) effectively. It's also fast enough to re-score a user profile every 30 seconds as their behavior changes.
Inference Strategy: We do not run the model for every request. We run the model when a user's feature set changes significantly (e.g., every 100 requests) and cache the resulting "Limit Tier" in Redis.
Training Pipeline
Labeling: We use a "Multi-label" approach. Positive labels = Confirmed DDoS/Scraper (from security team). Negative labels = High-value users (verified accounts).
Data Splitting: Time-based split is mandatory. We train on Monday-Friday and test on Saturday to ensure the model generalizes to different traffic cycles.
Handling Imbalance: Use Downsampling on the "Allow" class since 99.9% of traffic is typically legitimate.
Serving Pipeline
Serving Pattern: Asynchronous Tiering.
The API Gateway checks Redis for a
user_limit.If missing, use a "Default" tier.
An async worker periodically re-calculates the
user_limit using the LightGBM model based on recent Flink features.Latency Optimization: The hot path is a single Redis
GET. The ML complexity is pushed out of the request-response loop.Evaluation Pipeline
Offline: AUC and PR-AUC. Focus on Precision at 0.01% FPR. We cannot afford to block legitimate high-value users.
Online: A/B test the ML Rate Limiter against a static one. Measure "Total Scraped Pages" (via honey-pots) vs. "System Latency."
Monitoring Pipeline
Prediction Drift: Monitor the distribution of Risk Scores. If the model suddenly starts scoring everyone as "Risk=0.9", trigger a fallback to static limits.
System Health: Track the latency of the Flink -> Redis -> Gateway path.
Wrap Up
Final Evaluation
Observability: Use "Shadow Mode" where the ML model suggests a block but the system allows it, logging the outcome to verify accuracy before "Enforcement Mode."
Edge Cases: Cold Start IPs get a global average limit. If they behave well for 5 minutes, the ML model upgrades their limit tier.
Trade-offs: We sacrifice "instant" reaction (latency of a few seconds for the ML to update Redis) for request-path performance (sub-ms lookup).
Distinguishing Insight: Adversarial Retraining. Scrapers will adapt. I would implement a "Honey-pot" feedback loop where scrapers that hit "fake" data are automatically labeled as "positive" and fed back into the training pipeline daily.