The Question
ML Design

Real-Time Online Fraud Detection System

Design a high-scale, real-time fraud detection system for a global payment processor. The system must handle 10k+ QPS with sub-100ms P99 latency while minimizing both financial loss and false positives for legitimate users. Address the challenges of extreme class imbalance (0.1% fraud), significant label delay (30-90 days for chargebacks), and the need for near real-time feature updates (velocity signals). Explain your choice of model architecture, feature engineering strategy (online/offline consistency), and how you would evaluate and monitor the system in an adversarial environment where fraud patterns evolve rapidly.
LightGBM
XGBoost
Kafka
Flink
Redis
Spark
ONNX
SHAP
Feast
Questions & Insights

Clarifying Questions

Business Goal: Is the primary goal to minimize financial loss (high recall) or to minimize user friction/false positives (high precision)? Assumption: We want to maximize Precision at a fixed Recall (e.g., 90%) to ensure we don't block legitimate customers.
Constraints & Scale: What is the peak QPS and the latency budget? Assumption: 10,000 QPS with a P99 latency requirement of <100ms for inline blocking.
Edge Cases: How do we handle "label delay"? (Fraud labels often arrive 30-90 days later via chargebacks). Assumption: We will use a proxy label strategy and incremental retraining.
Data Freshness: How quickly must the system react to a new fraud pattern? Assumption: Near real-time (minutes) for feature updates.

Thinking Process

Identify the Bottleneck: The core challenge in fraud is class imbalance (0.1% fraud) and adversarial evolution (fraudsters change tactics).
Retrieval vs. Ranking: Unlike RecSys, fraud is usually a single-stage high-precision classifier. However, we need a fast path (rules/allow-lists) and a model path.
Feature Engineering is King: Fraud detection lives and dies by "velocity features" (e.g., "how many cards has this IP used in the last hour?").
Scale and Latency: To meet <100ms, we cannot do heavy on-the-fly computation; we need a low-latency Feature Store.

Elite Bonus Points

Handling Delayed Feedback: Implementing "Importance Sampling" or "Negative Downsampling" with calibration to account for the fact that we only see the "true" labels for transactions we didn't block (or those that resulted in chargebacks).
Human-in-the-loop (HITL): Design a system where high-uncertainty scores are sent to manual review queues, and those labels are fed back into the training loop as "silver standard" labels.
Graph-based Signal Extraction: Even in an MVP, using a "Community ID" or "Sink/Source" score derived offline from a graph of User-IP-Device-Card relationships to detect fraud rings.
Adversarial Robustness: Implementing "adversarial retraining" or monitoring "feature importance drift" to detect when fraudsters have figured out a specific feature (e.g., a specific zip code) and are exploiting it.
Design Breakdown

Requirements

Product Goal: Real-time detection and prevention of fraudulent transactions.
Success Metrics:
Online: Blocked Fraud Volume ($), False Positive Rate (FPR).
Offline: PR-AUC (Precision-Recall Area Under Curve), F1-Score.
Guardrail: P99 Latency < 100ms, System Availability > 99.99%.
System Constraints: 10k QPS, 500M historical transactions, sub-second feature freshness.
Data Availability: Transaction logs, user profiles, device fingerprints, historical chargeback data.

ML Problem Framing

ML Task Type: Binary Classification.
Prediction Target: P(\text{is\_fraud} | \text{User}, \text{Transaction}, \text{Context}).
Inputs:
User: Account age, verification status, historical spend patterns.
Item (Transaction): Amount, currency, merchant category.
Context: IP Geolocation, Device ID, Time of day.
Velocity: Number of attempts in the last N minutes.
ML Challenges: Extreme class imbalance, non-stationary data (concept drift), and verification latency.

Design Summary & MVP

Concise Summary: A real-time inference service triggered by transaction events, utilizing a hybrid of a Rule Engine for "hard-blocks" and a GBDT model for "probabilistic scoring," backed by a streaming feature pipeline.
Model Architecture & Selection:
Baseline: Heuristic rules (e.g., "Amount > $10k and new device").
Target Model: XGBoost or LightGBM.
Choice Rationale: GBDTs are superior for tabular data with heterogeneous features and handle missing values/outliers natively. They are faster to train and more interpretable than Deep Learning for an MVP.
Simplicity Audit: We avoid Deep Learning or Graph Neural Networks (GNNs) in the first iteration. We focus on high-quality velocity features stored in Redis.

System Architecture

Pipeline Deep Dive

Data Pipeline

Data Source: Real-time transaction events via Kafka, merchant metadata from RDS, and user profiles from DynamoDB.
Data Ingestion: Kafka acts as the backbone. We use "At-least-once" semantics; deduplication happens at the processing layer.
Data Storage: S3 (Parquet) for long-term storage and historical training. Partitioned by event_date and event_hour for efficient retrieval.

Feature Pipeline

Velocity Features: Using Flink to maintain sliding window aggregations (e.g., count_tx_1h, sum_amt_24h) and sinking them into Redis for sub-millisecond retrieval.
Feature Store: Use a tool like Feast. It ensures that the code used to compute features for training (Spark) is the same logic used for serving (Flink), eliminating Training/Serving skew.
Categorical Encoding: Use Target Encoding or Frequency Encoding for Merchant IDs, as one-hot encoding would lead to a massive, sparse feature space.

Model Architecture

Problem Formulation: Supervised binary classification.
Chosen Model: LightGBM.
Rationale:
Gradient Boosting handles the non-linear relationship between "Amount" and "Time of day" better than linear models.
LightGBM's "Histogram-based" learning is extremely memory efficient for large datasets.
Calibration: Since we downsample negatives to handle imbalance, the output probabilities are skewed. We use Platt Scaling or Isotonic Regression to recalibrate the scores into true probabilities.

Training Pipeline

Label Construction: Combine "Immediate Labels" (Rules) and "Delayed Labels" (Chargebacks).
Data Splitting: Time-based split (e.g., Train on Month 1-5, Test on Month 6). A random split would lead to data leakage because user behavior is correlated over time.
Imbalance Handling: Instead of SMOTE (which creates synthetic data that can be noisy), we use Scale_Pos_Weight in LightGBM to penalize misclassifying the minority (fraud) class.

Serving Pipeline

Hybrid Approach:
Rules Engine: Fast path (e.g., "Card on Blacklist"). Returns in <5ms.
ML Model: If rules pass, call LightGBM.
Optimization: Use ONNX or Treelite to compile the tree model into C++ code for ultra-fast inference.

Evaluation Pipeline

Offline: Precision-Recall Curve. We care about the Precision at 95% Recall.
Online: Shadow Mode. Run the new model alongside the old one. Log what the new model would have blocked and wait for the 30-day chargeback window to verify accuracy before "promoting" the model to "Active" mode.

Monitoring Pipeline

Data Drift: Monitor the distribution of input features (e.g., average transaction amount). If it shifts (e.g., due to inflation or a holiday season), the model may need retraining.
Prediction Drift: Monitor the % of transactions scored > 0.5. A sudden spike might indicate a new fraud attack or a broken feature.
Wrap Up

Final Evaluation

Trade-offs: We choose Interpretability over marginal gains in AUC. In fraud, when a legitimate user is blocked, Customer Support needs to know why (e.g., SHAP values: "Transaction amount was 10x your average").
Cold Start: For new users with no history, we rely on "Global Velocity" features (e.g., "Is this IP address currently attacking the platform?") and metadata-based rules.
Adversarial: Fraudsters will eventually "crack" the model. We implement an Automated Retraining Trigger based on daily performance metrics.