The Question

Intelligent Adaptive Rate Limiter

Design an intelligent rate-limiting system for a global-scale ML inference API (e.g., Search or Recommendations). The system must differentiate between legitimate high-burst users and malicious scrapers or DDoS actors. Focus on achieving sub-2ms decision latency while handling 500k+ QPS. Detail the data pipelines for real-time feature extraction, the model architecture for risk scoring, and how to minimize false positives through offline/online evaluation strategies. Explain how you would manage the trade-off between system protection and user experience using an ML-driven dynamic thresholding approach.

LightGBM

Flink

Kafka

Redis

Spark

Feast

Envoy

Prometheus

Questions & Insights

Clarifying Questions

Business Goal: Is the goal to prevent malicious DDoS attacks (security), manage costs for expensive ML inference (LLM/Search), or ensure fair usage across multi-tenant clients?

Assumption*: We are building an Intelligent Rate Limiter for a high-scale ML Recommendation API** to prevent scraping of proprietary data and prioritize high-value users during traffic spikes.

Constraints & Scale: What is the traffic volume and the latency budget?

Assumption*: 500k QPS (Query Per Second), 100M daily active users (DAU). The rate-limiting decision must be made in < 2ms** to avoid impacting the end-to-end P99 latency.

Data Freshness: How quickly must we adapt to new attack patterns?

Assumption: We need near real-time adaptation (minutes) for emerging botnets.

Edge Cases: How do we handle cold-start users (new IPs) and ensure we don't block high-value customers who just happen to have high burstiness?

Assumption: We use a tiered fallback system where new users default to a conservative static limit.

Thinking Process

Identify the Bottleneck: A standard "Token Bucket" is static and doesn't distinguish between a loyal user and a bot scraper. The ML challenge is to predict the Intent/Value of a request in micro-seconds.

Retrieval vs. Ranking: This isn't a retrieval problem. It's a high-speed Binary Classification problem (Allow vs. Block/Throttle).

Architecture Trade-off: We cannot call a heavy DNN for every API request. I need a hybrid approach: a Fast-Path (local cache/heuristics) and a Slow-Path (asynchronous ML scoring to update the Fast-Path's parameters).

Scaling: Feature engineering must be done on a stream (Flink) to capture burstiness metrics across distributed nodes.

Elite Bonus Points

Counterfactual Evaluation: How do we know how much revenue we lost by blocking a specific cohort? I would propose "Lasso Logging" (randomly allowing 0.1% of blocked traffic) to collect ground-truth labels for evaluation.

Feature Hashing for High-Cardinality IPs: To handle millions of IPs without exploding the model size, use the Hashing Trick to map IPs into a fixed-length embedding space.

System-Load Aware Thresholding: Instead of a fixed probability threshold (e.g., block if

p > 0.5

), make the threshold a function of current system CPU/Memory utilization (Dynamic PID control).

Embedding-based Bot Detection: Use Graph Neural Networks (GNNs) offline to identify clusters of "sybil" accounts/IPs and push those clusters to a Bloom Filter in the serving layer.

Design Breakdown

Requirements

Product Goal: Maximize API availability and protect proprietary content from scrapers.

Success Metrics:

Online Metrics: API Availability (Uptime), False Positive Rate (FPR - blocking real users), Scraper detection rate.

Offline Metrics: AUC-ROC for bot detection, Precision @ 99% Recall (to minimize user friction).

Guardrail Metrics: P99 Latency overhead (< 2ms), Cache Hit Rate for rate-limit rules.

System Constraints: 500k QPS, globally distributed, sub-millisecond local state access.

Data Availability: Request logs (IP, User-Agent, Path), User metadata (account age, tier), historical traffic patterns.

ML Problem Framing

ML Task Type: Binary Classification (Intention: Scraper vs. Human) and Regression (Predicted Request Value).

Prediction Target:

P(\text{IsAbusive} | \text{Request, User, Context})

Inputs:

User Features: Account age, historical spend, "Human" score from CAPTCHA.

Context Features: IP Reputation, ASN (Autonomous System Number), User-Agent, Time-of-day.

Behavioral Features: Request frequency (last 1s, 1m, 1h), entropy of request paths (scrapers are often systematic).

Outputs: A "Risk Score" [0, 1] used to adjust the Token Bucket replenishment rate or trigger a CAPTCHA.

ML Challenges: Label Delay (we only know it was a scraper after it has crawled 10k pages) and Adversarial Drift (bots change patterns).

Design Summary & MVP

Concise Summary: A two-tier rate limiter: a distributed Token Bucket (Redis) using dynamically updated limits provided by an asynchronous ML scoring service.

Model Architecture & Selection:

Baseline Model: Static rules + Heuristics (e.g., > 100 requests/sec = Block).

Target Model: LightGBM or Logistic Regression for the risk score.

Choice Rationale: Need extreme interpretability and fast inference. Tree models excel at tabular "frequency" data common in traffic logs.

ML Life Cycle Summary: Logs -> Flink (Windowed Features) -> S3/Feature Store -> LightGBM Training -> Redis (Score Deployment) -> Envoy Sidecar (Enforcement).

Simplicity Audit: Avoids real-time model inference in the request path. Instead, the model calculates scores offline or asynchronously and updates a lookup table (Redis), keeping the hot path O(1).

Architecture Decision Rationale:

Asynchronous Scoring: Decouples ML latency from API response time.

Feature Hashing: Handles the massive cardinality of IP addresses.

System Architecture

Pipeline Deep Dive

Data Pipeline

Data Source: Application logs (NGINX/Envoy) and User Event logs.

Data Ingestion: Kafka acts as the backbone. We use a Schema Registry to ensure the Rate Limiter service can parse request attributes consistently.

Data Storage: S3 for long-term storage (Parquet format, partitioned by date/hour).

Data Processing: Flink calculates Sliding Window Aggregates (e.g., requests_per_ip_last_60s). This is critical for catching "low and slow" scrapers.

Data Quality: De-duplication in Kafka and schema validation to prevent malformed requests from poisoning the model.

Feature Pipeline

Feature Definition:

Velocity: Requests per (1s, 10s, 1m) for (IP, UserID, API Key).

Metadata: Entropy of the User-Agent string, CIDR reputation.

Behavioral: Ratio of 404/403 errors to 200s (scrapers hit missing pages more often).

Online vs. Offline: Flink pushes real-time velocities to Redis. Spark pre-computes long-term user "trust scores" weekly.

Consistency: Use a Feature Store (e.g., Feast) to ensure the same code generates features for both training and serving.

Model Architecture

Problem Formulation: Supervised Binary Classification.

Target Model: LightGBM.

Rationale: It handles tabular data with missing values (e.g., new users missing history) effectively. It's also fast enough to re-score a user profile every 30 seconds as their behavior changes.

Inference Strategy: We do not run the model for every request. We run the model when a user's feature set changes significantly (e.g., every 100 requests) and cache the resulting "Limit Tier" in Redis.

Training Pipeline

Labeling: We use a "Multi-label" approach. Positive labels = Confirmed DDoS/Scraper (from security team). Negative labels = High-value users (verified accounts).

Data Splitting: Time-based split is mandatory. We train on Monday-Friday and test on Saturday to ensure the model generalizes to different traffic cycles.

Handling Imbalance: Use Downsampling on the "Allow" class since 99.9% of traffic is typically legitimate.

Serving Pipeline

Serving Pattern: Asynchronous Tiering.

The API Gateway checks Redis for a user_limit.

If missing, use a "Default" tier.

An async worker periodically re-calculates the user_limit using the LightGBM model based on recent Flink features.

Latency Optimization: The hot path is a single Redis GET. The ML complexity is pushed out of the request-response loop.

Evaluation Pipeline

Offline: AUC and PR-AUC. Focus on Precision at 0.01% FPR. We cannot afford to block legitimate high-value users.

Online: A/B test the ML Rate Limiter against a static one. Measure "Total Scraped Pages" (via honey-pots) vs. "System Latency."

Monitoring Pipeline

Prediction Drift: Monitor the distribution of Risk Scores. If the model suddenly starts scoring everyone as "Risk=0.9", trigger a fallback to static limits.

System Health: Track the latency of the Flink -> Redis -> Gateway path.

Wrap Up

Final Evaluation

Observability: Use "Shadow Mode" where the ML model suggests a block but the system allows it, logging the outcome to verify accuracy before "Enforcement Mode."

Edge Cases: Cold Start IPs get a global average limit. If they behave well for 5 minutes, the ML model upgrades their limit tier.

Trade-offs: We sacrifice "instant" reaction (latency of a few seconds for the ML to update Redis) for request-path performance (sub-ms lookup).

Distinguishing Insight: Adversarial Retraining. Scrapers will adapt. I would implement a "Honey-pot" feedback loop where scrapers that hit "fake" data are automatically labeled as "positive" and fed back into the training pipeline daily.