The Question
DesignScalable Payment Gateway Architecture
Design a global-scale payment processing system similar to Stripe. The system must handle millions of transactions per day, ensure absolute idempotency to prevent double-charging, and maintain a high degree of availability. Key challenges include managing PCI compliance, integrating with multiple third-party payment service providers (PSPs), and maintaining a consistent financial ledger. Discuss how you would handle network failures during the critical charge path and your strategy for reconciliation and merchant notifications.
PostgreSQL
Redis
Kafka
OAuth2
TLS 1.3
REST
PCI DSS
Saga Pattern
CDC
Questions & Insights
Clarifying Questions
Scale and Throughput: What is the target scale for the MVP? (e.g., 100M transactions per day, 5,000 peak QPS?)
Payment Methods: Are we supporting just Credit Cards initially, or do we need immediate support for Wallets (ApplePay) and Bank Transfers?
Global Presence: Does the system need to handle multi-currency and multi-region regulatory requirements (e.g., PSD2 in Europe, India's 2FA)?
Payouts vs. Pay-ins: Is the focus primarily on accepting payments (Pay-ins) or also on the complex logic of distributing funds to merchants (Payouts)?
Compliance: Is the scope including building a PCI-compliant vault, or are we leveraging a third-party tokenization service (like Spreedly) for the MVP?
Assumptions for MVP:
Scale: 100M transactions/day, ~1,200 avg QPS, 5,000 peak QPS.
Methods: Credit/Debit cards only.
Geography: Global API, but regional processing for low latency.
Compliance: We will use a dedicated "Tokenization Vault" to keep the main system out of PCI DSS scope.
Focus: Pay-ins (collecting money) and Webhook notifications.
Thinking Process
Core Bottleneck: The "Double Charge" problem. We must ensure financial atomicity and strict idempotency across distributed network calls.
Key Questions:
How do we guarantee we never charge a user twice if the network fails during a request? (Idempotency).
How do we maintain a record of truth when external Bank APIs are eventually consistent or unreliable? (State Machine + Reconciliation).
How do we isolate sensitive PCI data while maintaining high developer velocity? (Tokenization Vault).
How do we handle downstream failures without blocking the user? (Asynchronous Webhooks & Retry Queues).
Bonus Points
Double-Entry Bookkeeping: Implementing a ledger that records every movement of money as a credit and a debit to ensure the sum is always zero (Financial Integrity).
Chaos Engineering for PSPs: Implementing a "Circuit Breaker with Fallback" that can automatically route traffic to a secondary Payment Service Provider (PSP) if the primary (e.g., Adyen) experiences latency or elevated 5xx errors.
Idempotency Fingerprinting: Using a combination of Request Body Hashing + Idempotency Keys to prevent "Key Reuse" attacks where different payloads are sent with the same key.
Shadow Mode Testing: Ability to route a percentage of production traffic to a new PSP adapter to compare results without affecting real money.
Design Breakdown
Functional Requirements
Core Use Cases:
Accept payments via API with an Idempotency Key.
Securely tokenize payment information.
Track payment status (Pending, Succeeded, Failed, Refunded).
Notify merchants of status changes via Webhooks.
Scope Control:
In-Scope: Pay-ins, Webhooks, Basic Ledgering, Tokenization.
Out-of-Scope: Currency conversion (FX), Dispute management (Chargebacks), complex Merchant Onboarding (KYC/KYB).
Non-Functional Requirements
Scale: Handle 100M+ transactions daily.
Latency: P99 < 500ms for the "Charge" API (including external PSP call).
Availability: 99.999% (High availability is critical for financial services).
Consistency: Strong consistency for the Ledger and Transaction Status.
Fault Tolerance: Automatic retries for downstream PSP timeouts.
Security: PCI DSS compliance, TLS 1.3, and strict IAM.
Estimation
Traffic: 100M trans/day / 86,400s \approx 1,157 TPS (Avg). Peak is 5x \approx 5,700 TPS.
Storage: Each transaction record \approx 2KB. 100M * 2KB = 200GB/day. 73TB/year.
Bandwidth: 5,700 TPS * 2KB \approx 11.4 MB/s. Well within standard 10Gbps networking.
Blueprint
The architecture centers on a State-Machine driven Payment Service that interacts with an external PSP. It uses a Tokenization Vault to minimize PCI scope and a Distributed Ledger to ensure financial correctness.
API Gateway: Handles authentication, rate limiting, and request routing.
Tokenization Vault: Encrypts and stores raw Credit Card data, returning a "Token" used by all other services.
Payment Service: The orchestrator that manages the lifecycle of a payment.
PSP Adapter: A strategy-pattern based component that talks to external providers (Stripe, Adyen, Braintree).
Ledger Service: An immutable record of all financial movements.
Messaging (Kafka): Decouples the core payment flow from asynchronous notifications (Webhooks).
Simplicity Audit: We avoid complex microservices for every sub-domain. Instead, we use a modular monolith or small set of services focused on the "Transaction" and "Ledger" as the two source-of-truth pillars.
Architecture Decision Rationale:
Why RDBMS?: Payment data requires ACID properties. Postgres is the industry standard for financial transactions.
Why Idempotency?: Network failures are inevitable. Idempotency keys allow clients to retry safely.
High Level Architecture
Sub-system Deep Dive
Edge (Optional)
Content Delivery & Traffic Routing: DNS-based global load balancing (GSLB) to route requests to the nearest region.
Security & Perimeter:
API Gateway: Implements OAuth2/API Key validation.
Rate Limiting: Tiered limits (e.g., 100 requests/sec per merchant).
WAF: Protects against SQL injection and common web attacks.
Service
Topology: Stateless services deployed in multiple Availability Zones (AZs).
Payment API Schema:
POST /v1/chargesProtocol: REST/JSON.
Request:
{ "amount": 1000, "currency": "USD", "token": "tok_123", "idempotency_key": "uuid-456" }.Idempotency: Mandated for all POST requests.
Resilience:
Retries: Exponential backoff with jitter for PSP 5xx errors.
Circuit Breaker: Trips if PSP latency > 2s to prevent resource exhaustion.
Storage
Access Pattern: 70% write-heavy during payment, 30% read (dashboard/reporting).
Database Table Design (Payment DB):
id (UUID, PK)merchant_id (UUID, Index)amount (BigInt, store in cents)status (Enum: PENDING, SUCCESS, FAILED)idempotency_key (String, Unique Index)token_id (String)Technical Selection: PostgreSQL with Partitioning by
created_at or merchant_id.Distribution: Single-leader replication (Semi-sync) for data durability.
Cache
Purpose: Strict idempotency check to prevent duplicate charges.
Key-Value Schema:
idempotency_key:{key} -> {response_body}.TTL: 24 hours. After 24h, the DB serves as the fallback for idempotency checks.
Technical Selection: Redis (Cluster mode) for sub-millisecond lookups.
Messaging
Purpose: Asynchronous delivery of Webhooks and Analytics events.
Event Schema:
payment.succeeded, payment.failed. Includes charge_id and timestamp.Failure Handling: Dead-letter queues (DLQ) for failed webhooks with a retry policy (up to 12 attempts over 3 days).
Technical Selection: Kafka for high throughput and durability.
Data Processing
Processing Model: Stream processing for real-time reconciliation.
Processing DAG: Consumes events from Kafka -> compares with Ledger -> flags discrepancies to an Admin Dashboard.
Technical Selection: Flink or simple Kafka Streams. [Justification: Necessary for ensuring "Ledger vs PSP" reconciliation].
Wrap Up
Advanced Topics
Consistency vs. Availability (PACELC): In a partition, we choose Consistency (PC/EC). We cannot risk "phantom" money or double charging. If the DB is down, we return 503.
Reliability: We use a Transactional Outbox Pattern. The Payment Service writes to the DB and an "Outbox" table in the same transaction. A separate process reads the outbox and pushes to Kafka to ensure atomicity between DB and Messaging.
Bottleneck Analysis: The external PSP is the slowest link. We use asynchronous processing where possible but the initial authorization must be synchronous to give the user immediate feedback.
Security: The Tokenization Vault is physically/logically isolated. It only exposes two endpoints:
/tokenize and /detokenize (the latter restricted to the PSP Adapter).