The Question
DesignScalable Payment Processing System Design
Design a globally scalable, PCI-compliant payment gateway infrastructure capable of handling thousands of transactions per second. The system must guarantee idempotency, ensure financial integrity through double-entry bookkeeping, and manage asynchronous communications with multiple third-party Payment Service Providers (PSPs).
PostgreSQL
Redis
Kafka
PCI DSS
ACID
mTLS
JWT
AWS RDS
EKS
Questions & Insights
Clarifying Questions
Scale and Throughput: What is the expected Peak QPS for payment initiations? (Assumption: 1,000 QPS for MVP, scaling to 10k+ later).
Payment Methods: Are we supporting just Credit/Debit cards, or also Wallets, Bank Transfers, and Crypto? (Assumption: MVP focuses on Credit/Debit cards via external Gateways).
Geographic Scope: Is this a single-region deployment or do we need to handle cross-border payments and local data residency? (Assumption: Single-region deployment with global gateway connectivity).
Compliance: Is the system required to be fully PCI-DSS compliant, or are we offloading card data handling to a third-party vault? (Assumption: We use a dedicated Tokenization Vault to minimize PCI scope).
Ledger Requirement: Do we need a full double-entry bookkeeping system for every transaction? (Assumption: Yes, for financial integrity and auditability).
Thinking Process
How to handle external gateway failures? Implement a state machine to track payment lifecycles and use a background worker for asynchronous reconciliation.
How to prevent double-charging users? Use mandatory Idempotency Keys at the API Gateway and Service layers.
How to scale the ledger without losing consistency? Use a relational database with strict ACID properties and a double-entry accounting model.
How to minimize PCI-DSS audit scope? Separate the "Card Vault" (handling raw PII) from the "Payment Service" (handling tokens).
Bonus Points
Double-Entry Bookkeeping: Designing the ledger such that every movement of money is recorded as a debit in one account and a credit in another, ensuring the sum is always zero.
Idempotency Strategy: Implementation of a deterministic "Idempotency Layer" that stores the hash of request payloads to prevent semantic changes on retries.
Poison Pill Handling: Specialized Dead Letter Queue (DLQ) strategies for payments that fail due to logic errors vs. transient gateway timeouts.
Clock Drift Mitigation: For high-frequency ledger updates, using Hybrid Logical Clocks (HLC) or TrueTime to maintain causal ordering across distributed nodes.
Design Breakdown
Functional Requirements
Core Use Cases:
Accept and process one-time card payments.
Tokenize sensitive card information via a secure Vault.
Support Webhook notifications for asynchronous status updates from gateways.
Provide a searchable history of transactions for merchants.
Scope Control:
In-scope: Pay-in (Collection), Tokenization, Gateway Integration, Basic Ledger.
Out-of-scope: Subscription billing (recurring), Payouts to bank accounts (Disbursements), Fraud detection engine (MVP will use gateway-level fraud checks).
Non-Functional Requirements
Scale: Support up to 1,000 transactions per second (TPS).
Latency: API response time < 500ms (excluding external gateway calls).
Availability & Reliability: 99.99% uptime; payments must not be lost once "Accepted."
Consistency: Strong consistency for the Ledger; Eventual consistency for the Merchant Dashboard.
Fault Tolerance: Automatic retries for transient gateway failures with exponential backoff.
Security & Privacy: PCI-DSS compliance (Level 1), TLS 1.3 for all transit, encryption at rest for sensitive tokens.
Estimation
Traffic Estimation:
1,000 Peak TPS.
Daily: ~20 Million transactions (assuming 8-hour heavy load).
Storage Estimation:
Payment record: ~1 KB.
1 year of data: 20M \times 365 \times 1KB \approx 7.3 TB.
Bandwidth Estimation:
Incoming: 1,000 \times 1KB = 1 MB/s.
Outgoing (Webhooks): 1,000 \times 1KB = 1 MB/s.
Blueprint
Concise Summary: A microservices-based architecture centered around a Payment Orchestrator that interacts with a secure Vault and multiple external Gateways, backed by a relational Ledger.
Major Components:
API Gateway: Handles authentication, rate limiting, and idempotency termination.
Payment Service: The core orchestrator managing the state machine of a payment request.
Vault Service: A high-security micro-service that exchanges raw card data for non-sensitive tokens.
Ledger Service: An ACID-compliant service recording the movement of funds using double-entry principles.
Webhook Service: Handles asynchronous callbacks from external Payment Service Providers (PSPs).
Simplicity Audit: This design avoids complex service meshes and custom consensus algorithms, relying instead on proven RDBMS transactions for financial correctness.
Architecture Decision Rationale:
Why this architecture?: Separation of concerns between PII (Vault), Business Logic (Orchestrator), and Financial Integrity (Ledger) is the industry standard for security and auditability.
Functional Satisfaction: Meets all core flows from ingestion to notification.
Non-functional Satisfaction: Scalable through horizontal service scaling; reliable through state machine and retry logic.
High Level Architecture
Sub-system Deep Dive
Edge (Optional)
Content Delivery & Traffic Routing: DNS-based routing (Route53) to the nearest regional API Gateway.
Security & Perimeter:
API Gateway: AWS API Gateway or Kong. Handles JWT validation for Merchant API Keys.
Idempotency: Gateway checks
Idempotency-Key header against Redis. If key exists, return the cached response; if not, pass to Payment Service.WAF: Standard protection against SQLi and XSS.
Service
Topology & Scaling: Stateless microservices deployed on Kubernetes (EKS) across 3 Availability Zones. Auto-scaling triggered by CPU and Request Count.
API Schema Design:
POST /v1/paymentsProtocol: REST/JSON
Request:
{ amount, currency, payment_method_token, idempotency_key }Response:
{ payment_id, status: PENDING|SUCCESS|FAILED }Idempotency: Mandatory header.
Resilience & Reliability:
Circuit Breakers: Applied to PSP Gateway calls to prevent cascading failures if a provider is down.
Retries: Exponential backoff with jitter for HTTP 5xx errors from PSPs.
Storage
Access Pattern: Write-heavy (Payment creation); Read-heavy (Merchant dashboard).
Database Table Design:
Payments Table:
id (PK), merchant_id, amount, status, external_txn_id, created_at.Ledger Entries:
id, account_id, debit_amount, credit_amount, txn_type (PAYMENT_CAPTURE).Technical Selection: PostgreSQL.
Rationale: Strict ACID compliance is mandatory for financial records. Supports complex joins for reporting.
Distribution Logic: Sharding by
merchant_id to handle scale, though 7.3TB/year fits in a large RDS instance or Aurora cluster with ease.Cache
Purpose & Justification: Idempotency enforcement and session management.
Key-Value Schema: Key:
idempotency:{merchant_id}:{key}, Value: SerializedResponse. TTL: 24 hours.Technical Selection: Redis (Cluster Mode).
Failure Handling: If Redis is down, the system fails-safe by rejecting requests to prevent double-charging until the cache is restored.
Messaging
Purpose & Decoupling: Decouples the primary payment flow from secondary actions like notifications and analytics.
Event Schema:
PaymentCreated, PaymentSucceeded, PaymentFailed.Technical Selection: Kafka.
Failure Handling: Dead-letter queues for events that fail to trigger notifications.
Infrastructure (Optional)
Observability: Prometheus for metrics (Latency, Error Rates), ELK Stack for logs.
Platform Security: mTLS between all internal services. Vault Service has restricted VPC access.
Wrap Up
Advanced Topics
Trade-offs: We chose Consistency over Availability (CP in CAP) for the Ledger. If the DB is down, we cannot process payments, as missing a record of money is worse than temporary downtime.
Reliability: A State Machine in the Payment Service ensures that a transaction cannot move from
FAILED back to SUCCESS and handles partial failures gracefully.Bottleneck Analysis: The External PSP is the primary bottleneck. We use an Adapter Pattern to allow easy switching or load-balancing between Stripe, Adyen, and Braintree.
Security: Card data never touches the Payment Service. The user's browser sends card data directly to the Vault, which returns a token. This limits PCI-DSS scope to only the Vault Service.