The Question

Real-time Payment Fraud Detection System

Design a high-scale, low-latency payment risk scoring system for a global fintech platform. The system must process 10k+ QPS with a P99 latency under 50ms. Address the end-to-end ML lifecycle: specifically, how you handle real-time feature engineering (velocity features), extreme class imbalance, and the inherent 30-90 day label delay (delayed feedback loop). Your design should include robust data and monitoring pipelines to detect adversarial drift and ensure system reliability via fallback mechanisms.

LightGBM

Kafka

Flink

Redis

Spark

ONNX

Treelite

Protobuf

Prometheus

Questions & Insights

Clarifying Questions

Business Goal: What is the primary North Star metric?

Answer: Minimize the Fraud Loss Rate (basis points of GMV) while keeping the False Positive Rate (FPR) below a threshold to minimize user friction (insult rate).

Constraints & Scale: What is the throughput and latency budget?

Answer: 10,000 QPS at peak; P99 latency must be < 50ms for the entire scoring path to avoid payment timeouts.

Data Freshness: How quickly must features reflect new transactions?

Answer: High freshness is critical. Features like "number of transactions in the last 10 minutes" must be updated in near real-time (< 1s).

Label Delay: How long does it take to get ground truth?

Answer: Confirmed fraud (chargebacks) can take 30–90 days, though "verified fraud" from manual reviews may arrive in hours/days.

Assumptions:

We have a corpus of millions of historical transactions.

Fraud is a rare event (< 1% of transactions).

We have access to user profile data, real-time transaction metadata, and device fingerprints.

Thinking Process

Identify the Bottleneck: In fraud detection, the bottleneck isn't usually the model complexity, but feature engineering—specifically calculating real-time aggregations (velocity features) at scale without exceeding latency budgets.

Retrieval vs. Ranking: Unlike RecSys, this is a pure classification/ranking problem. We don't "retrieve" fraud; we score an incoming event against historical patterns.

Scale Strategy: Use a Lambda architecture or a high-performance streaming engine (Flink) to feed a low-latency Feature Store (Redis).

Model Choice: GBDTs (LightGBM/XGBoost) are the industry standard for tabular fraud data due to their handling of missing values and non-linear relationships with high interpretability compared to Deep Learning for an MVP.

Elite Bonus Points

Delayed Feedback Loops: Implement a "Warm-start" or "Incremental training" strategy to handle the 90-day label delay by training on proxy labels (e.g., rule-based triggers or manual reviews).

Graph-based Signal Extraction: Identify "Fraud Rings" by creating features derived from a real-time Identity Graph (e.g., 5 different accounts using the same device ID or credit card).

Adversarial Drift Monitoring: Since fraudsters actively change tactics, we monitor the Population Stability Index (PSI) of the top-decile scores to detect attack shifts before chargebacks actually roll in.

Decision Engine Orchestration: Separate the "ML Score" from the "Business Logic." The ML model provides a probability, but a separate rules engine (Drools/Python) applies thresholds and "Hard Blocks" for regulatory compliance.

Design Breakdown

Requirements

Product Goal: Assign a risk score (0-1000) to every transaction to decide: Approve, Challenge (MFA), or Decline.

Success Metrics:

Online: Fraud Loss Rate, Transaction Approval Rate.

Offline: Precision-Recall AUC (PR-AUC), False Positive Rate at a fixed 95% Recall.

Guardrail Metrics: P99 Latency < 50ms, Model Score Drift (KL Divergence).

System Constraints: 10k QPS, high availability (99.99%), sub-second feature freshness.

Data Availability: User account age, transaction amount, IP geolocation, device fingerprint, historical chargeback logs.

ML Problem Framing

ML Task Type: Binary Classification.

Prediction Target:

P(\text{is\_fraud} = 1 \mid \text{Transaction, User, Context})

Inputs:

User: Account age, historical fraud flags, identity verification status.

Item (Transaction): Amount, currency, merchant category code (MCC).

Context: Device ID, IP-to-Geo, time-of-day, "Velocity" (e.g., 1-hour transaction count).

Outputs: A calibrated probability score.

ML Challenges: Extreme class imbalance (99.9% legit), concept drift (adversarial), and high cost of false positives.

Design Summary & MVP

Concise Summary: A real-time scoring system using a LightGBM model served via a REST API, powered by a Flink-based streaming feature pipeline and a Redis feature store.

Model Architecture & Selection:

Baseline Model: Logistic Regression on raw features.

Target Model: LightGBM (Gradient Boosted Decision Trees).

Choice Rationale: LightGBM handles tabular data, missing values, and categorical features (like Merchant ID) more efficiently than NNs for an MVP. It offers superior training speed and inference latency.

ML Life Cycle Summary: Raw logs are ingested via Kafka; Flink computes windowed aggregates; Features are stored in Redis; Model is trained on historical S3 logs using SageMaker/Ray; Serving is done via an optimized C++/Go wrapper around the model artifact.

Simplicity Audit: We avoid complex Graph Neural Networks (GNNs) or LSTMs for the MVP. Aggregated features (count/sum) capture most of the temporal signal without the complexity of RNNs.

System Architecture

Pipeline Deep Dive

Data Pipeline

Data Source: Real-time transaction requests, user profile DBs, and third-party device fingerprinting signals.

Data Ingestion: Kafka acts as the backbone. We use protobuf for schema enforcement to ensure the serving and training data structures match perfectly.

Data Storage: S3/HDFS for historical logs. Data is partitioned by event_date and event_hour in Parquet format for efficient analytical queries.

Data Processing: Flink processes the Kafka stream to generate "Velocity" features. It handles out-of-order events using watermarks.

Data Quality: We implement Great Expectations checks on the ingestion layer to catch malformed IP addresses or null transaction amounts before they hit the feature store.

Feature Pipeline

Feature Definition:

Velocity: count_trans_1h, sum_amount_24h (per User/Device/Card).

Categorical: merchant_id, device_os, country_code.

Cross features: user_id + merchant_category.

Online vs Offline:

Online: Flink updates Redis with rolling window counts.

Offline: Spark recreates these same windows from S3 logs to ensure Point-in-Time Correctness during training (preventing data leakage).

Feature Store: Redis serves as the online store (< 2ms lookups). We use a unified Feature Registry to define the logic once and generate both Flink SQL and Spark SQL.

Model Architecture

Problem Formulation: Supervised binary classification.

Candidate Model Families:

Logistic Regression: Too simple, misses non-linear interactions.

Deep Learning (MLP): Requires extensive scaling/normalization, harder to interpret.

LightGBM (Winner): Best-in-class for tabular data; native support for categorical features; highly optimized for CPU inference.

Architecture Design:

Input Layer: Continuous features (normalized) + Categorical (Label Encoded).

Core: Forest of 500–1000 trees with leaf-wise growth.

Optimization: Use Early Stopping to prevent overfitting on the majority class. Use tree pruning to keep inference latency low.

Training Pipeline

Labeling: We use a "Silver Label" approach. Transactions with high-confidence rule-based flags are labeled "Fraud" early. Chargebacks (Gold Labels) are added as they arrive.

Handling Imbalance: We use Scale_Pos_Weight in LightGBM rather than SMOTE to maintain the true distribution of data while penalizing misclassified fraud more heavily.

Data Splitting: Time-series split. Train on months 1–5, validate on month 6. This mimics the production environment where we predict the future based on the past.

Retraining: Weekly automated retraining to capture new fraud patterns.

Serving Pipeline

Serving Pattern: Synchronous Request-Response. The payment gateway waits for our score.

Latency Optimization:

Pre-fetch features from Redis in parallel.

Model is converted to ONNX or Treelite format for optimized C++ execution.

Reliability: If the model service times out (> 50ms) or fails, we fall back to a "Static Rule-base" (e.g., block if amount > $10k and new device) to ensure we are never the cause of a total payment outage.

Evaluation Pipeline

Offline: Focus on PR-AUC. In fraud, ROC-AUC is misleading because the True Negative rate is overwhelming. We specifically look at Recall at 1% FPR (What % of fraud do we catch if we only insult 1 in 100 good users?).

Online: Shadow Mode. Run the new model in production, log its scores, but don't act on them. Compare its predicted fraud vs. the legacy model.

Monitoring Pipeline

Prediction Drift: Monitor the average score per hour. If it jumps from 0.01 to 0.05, it signals either a mass attack or a feature pipeline bug.

Feature Drift: Monitor the distribution of "Transaction Amount" or "IP Country" using Population Stability Index (PSI).

Label Delay Monitoring: Track "Partial Labels." Use manual review samples to get a "Lead Indicator" of model performance before the 90-day chargeback window closes.

Wrap Up

Final Evaluation

Trade-offs: We prioritize Latency and Explainability (GBDT) over marginal accuracy gains from complex Ensembles/RNNs for the MVP.

Edge Cases:

Cold Start: New users get a "Population Mean" risk score until they have 3 transactions.

Adversarial Attacks: Use a "Rate Limiter" on the scoring service to prevent fraudsters from probing the model scores.

Distinguishing Insight: A Principal Engineer knows that Model Calibration is vital. A raw score of 0.8 must mean an 80% probability of fraud so that the financial risk can be calculated in dollars (Expected Loss = Score * Amount).