The Question
DesignScalable Payment Processing System
Design a globally available, highly reliable payment processing system. The system must handle high transaction volumes (10,000 TPS) while ensuring strict ACID compliance for financial records. Address the challenges of exactly-once processing, integration with multiple third-party payment gateways (e.g., Stripe, Adyen), handling asynchronous webhooks, and a robust reconciliation mechanism to detect discrepancies between internal state and external gateway reality. Define your strategy for idempotency, state management, and PCI-DSS compliance scope reduction.
PostgreSQL
Redis
Kafka
Apache Flink
Kubernetes
Docker
gRPC
OAuth2
TLS 1.3
AES-256
SAGA Pattern
CDC
Questions & Insights
Clarifying Questions
Scale and Throughput: What is the expected peak Transactions Per Second (TPS)?
Assumption: 10,000 TPS peak, supporting 100M+ Daily Active Users.
Consistency Requirements: Is eventual consistency acceptable for the ledger or transaction status?
Assumption: Strict consistency (ACID) is required for payment status and ledger to prevent double-spending or lost records.
Payment Methods: Are we supporting only Credit Cards, or also Wallets (Apple Pay), Bank Transfers, and Crypto?
Assumption: MVP focuses on Credit Cards and Digital Wallets via 3rd-party gateways (Stripe, Adyen).
Global Presence: Does the system need to handle cross-border payments and multi-currency?
Assumption: Single region for MVP (e.g., US-East-1) with multi-currency support in the database.
PCI Compliance: Are we storing raw PAN (Primary Account Number) data?
Assumption: No. We use a Tokenization Provider to minimize PCI-DSS scope.
Thinking Process
Core Bottleneck: The primary challenge is ensuring Exactly-Once Processing in a distributed environment where network partitions and timeouts are common.
Idempotency Strategy: How do we prevent double-charging a customer when a "Retry" is triggered by the client or a timeout occurs at the gateway?
State Machine Management: How do we transition a payment from
CREATED to PENDING, AUTHORIZED, CAPTURED, or FAILED reliably?Asynchronous Reconciliation: How do we ensure our internal records match the external gateway's reality at the end of the day?
Bonus Points
Deterministic State Machine: Implementing a state machine where transitions are guarded by database constraints to prevent race conditions.
Distributed Tracing with Idempotency: Passing the
X-Idempotency-Key across all microservices and even to the external Payment Service Provider (PSP) to ensure end-to-end uniqueness.Two-Phase Commit (2PC) vs. Sagas: Using an Orchestrated Saga pattern for long-running payment flows to maintain data integrity without locking the database for seconds.
Shadow Mode Reconciliation: Running a continuous, real-time reconciliation stream that flags discrepancies within seconds rather than waiting for a T+1 batch job.
Design Breakdown
Functional Requirements
Core Use Cases:
Execute Payment: Process a transaction from a customer to a merchant.
Refund: Reverse a previously successful transaction.
Payment Status: Allow users/merchants to query the current state of a payment.
Webhook Handling: Receive and process asynchronous updates from Payment Gateways.
Scope Control:
In-Scope: Pay-in (Checkout) flow, Tokenization integration, Idempotency, and Ledgering.
Out-of-Scope: Pay-out (Disbursements to bank accounts), Fraud Detection Engine (assume an external API call), and detailed Tax Calculation.
Non-Functional Requirements
Scale: Support 10k TPS with horizontal scaling of stateless services.
Latency: P99 for payment initiation should be < 500ms (excluding external gateway latency).
Availability & Reliability: 99.999% availability (High Availability) to avoid revenue loss.
Consistency: Strong consistency for transaction records and idempotency keys.
Fault Tolerance: Automatic retries with exponential backoff for transient gateway failures.
Security & Privacy: PCI-DSS compliance, TLS 1.3 for all transit, and encryption at rest for PII.
Estimation
Traffic Estimation:
Average QPS: 1k TPS. Peak: 10k TPS.
Webhook QPS: Roughly equal to transaction QPS (1:1 or 1:2 ratio).
Storage Estimation:
Each transaction record: ~1 KB.
10k TPS * 86,400s = ~864M transactions/day.
864M 1KB = ~864 GB/day**.
1 Year = ~315 TB. (Requires aggressive sharding/archiving strategy).
Bandwidth Estimation:
10k TPS * 1KB = 10 MB/s (Inbound/Outbound). Well within standard 10Gbps networking.
Blueprint
Concise Summary: A microservices-based architecture utilizing a sharded RDBMS for strong consistency, a Redis-based idempotency layer, and an asynchronous reconciliation engine via Kafka.
Major Components:
API Gateway: Handles authentication, rate limiting, and request routing.
Payment Service: The core orchestrator managing the payment lifecycle and state machine.
Idempotency Store (Redis): Ensures no duplicate requests are processed within a 24-hour window.
Transaction DB (PostgreSQL): Stores the source of truth for all payment states using ACID transactions.
PSP Integration Service: Translates internal requests to gateway-specific protocols (Stripe/Adyen).
Reconciliation Worker: An async consumer that matches internal records with external gateway reports.
Simplicity Audit: This architecture avoids complex distributed transactions (like 2PC) by using a centralized RDBMS for state and asynchronous events for downstream consistency.
Architecture Decision Rationale:
Why this architecture?: RDBMS is chosen for the "Source of Truth" because financial data cannot tolerate the eventual consistency anomalies of NoSQL (e.g., DynamoDB/Cassandra) without complex application-level logic.
Functional Requirement Satisfaction: Covers the full lifecycle from initiation to reconciliation.
Non-functional Requirement Satisfaction: Sharded RDBMS scales horizontally; Kafka ensures decoupling and fault tolerance during high load.
High Level Architecture
Sub-system Deep Dive
Edge (Optional)
Content Delivery & Traffic Routing: Global DNS (Route53) with latency-based routing.
Security & Perimeter:
API Gateway: Provides OAuth2/JWT validation.
Rate Limiting: Tiered limits (e.g., 100 req/sec per Merchant ID) to prevent DDoS and API abuse.
WAF: Protects against SQL Injection and Cross-Site Scripting.
Service
Topology & Scaling: Stateless microservices deployed on Kubernetes (EKS) across 3 Availability Zones. Auto-scaling based on CPU/Request count.
API Schema Design:
POST /v1/payments:
Request:
{ "amount": 100, "currency": "USD", "payment_method": "tok_123", "idempotency_key": "uuid-987" }Protocol: REST/JSON.
Idempotency: Mandatory header
X-Idempotency-Key.GET /v1/payments/{id}: Returns current state.
Resilience & Reliability:
Circuit Breakers: If Stripe is down, the PSP Integration Service trips the breaker to fail fast and notify the user immediately.
Retries: Exponential backoff (2s, 4s, 8s) for
5xx errors from PSPs.Storage
Access Pattern: 70% Read (Status checks, Admin dashboard), 30% Write (New transactions, Webhook updates).
Database Table Design:
Table: `payments
id: UUID (Primary Key)merchant_id: UUID (Indexed)amount: Decimal(19,4)status: ENUM (Created, Pending, Success, Failed)idempotency_key: String (Unique constraint with Merchant ID)version: Integer (For optimistic locking)Technical Selection: PostgreSQL.
Rationale: ACID compliance, strong consistency, and excellent support for
JSONB for storing dynamic gateway metadata.Distribution Logic: Sharded by
merchant_id to ensure all transactions for a single merchant live on the same shard, facilitating faster reporting and aggregate consistency.Cache
Purpose & Justification: Prevent double processing of the same request payload within a short TTL.
Key-Value Schema:
Key:
idempotency:{merchant_id}:{key}Value:
{ "status": "processing" | "completed", "response_body": "..." }Technical Selection: Redis.
Rationale: Sub-millisecond latency and built-in TTL support.
Failure Handling: If Redis is down, the system falls back to a unique constraint check on the Transaction DB (Safe but slower).
Messaging
Purpose & Decoupling: Asynchronously process webhooks and update the ledger/notifications without blocking the main payment thread.
Event / Topic Schema:
payment.status.updatedPayload:
{ "payment_id": "...", "old_status": "...", "new_status": "..." }Technical Selection: Kafka.
Rationale: High throughput, persistence for 7 days (allows for replay during recovery), and consumer groups for scaling reconciliation.
Data Processing
Processing Model: Hybrid. Webhooks are processed in real-time (Streaming); Daily reconciliation is a Batch job.
Processing DAG:
Kafka -> Flink -> Ledger Update -> Anomaly Detection Sink.Technical Selection: Apache Flink.
Rationale: Handles out-of-order events (late webhooks) using watermarks, critical for financial accuracy.
Infrastructure (Optional)
Observability:
Metrics: Prometheus tracking
payment_success_rate and gateway_latency.Distributed Tracing: Jaeger/OpenTelemetry to track a payment request across Orchestrator -> PSP Integration -> External Gateway.
Wrap Up
Advanced Topics
Trade-offs: We chose CP (Consistency/Partition Tolerance) over AP. If the DB is unavailable, we stop taking payments rather than risk inconsistent data or double charges.
Reliability: Uses the Outbox Pattern to ensure that if a DB update succeeds, a corresponding event is eventually published to Kafka, even if the service crashes mid-way.
Bottleneck Analysis: The primary bottleneck is the write-lock on the RDBMS. Vertical scaling followed by Citus-style sharding is the mitigation path for 10x growth.
Security: All PII and Credit Card Tokens are encrypted using AES-256 GCM. Sensitive logs are scrubbed for PII.
Distinguishing Insights: A key "Senior" move is handling the "Ambiguous Timeout". If the gateway doesn't respond, the system marks the payment as
PENDING and spawns a background worker to query the gateway status (GET /charge/{id}) before allowing any retry.