The Question
DesignScalable Payment Orchestration & Ledger System
Design a globally distributed payment system similar to Stripe. The system must support high-volume transaction processing, ensure strict idempotency to prevent double-charging, and maintain an immutable ledger for financial auditing. Focus on the end-to-end lifecycle of a payment, including interaction with external Payment Service Providers (PSPs), handling asynchronous webhooks, and ensuring system-wide consistency and PCI-DSS compliance constraints.
PostgreSQL
Redis
Kafka
PCI-DSS
Saga Pattern
TLS 1.3
OpenTelemetry
Stripe
Adyen
Docker
Kubernetes
Questions & Insights
Clarifying Questions
Scale and Throughput: What is the target peak Transactions Per Second (TPS)? For an MVP, are we targeting 100 TPS or 10,000 TPS?
Payment Methods: Should we support multiple Payment Service Providers (PSPs) (e.g., Adyen, Braintree) or a single gateway for the MVP?
Consistency Requirements: Is strict serializability required for the internal ledger, or is eventual consistency acceptable for reporting?
Geographic Scope: Is this a single-region deployment or do we need to handle cross-border payments and currency conversions immediately?
Compliance: Is the system expected to be fully PCI-DSS compliant, or will we use a vaulting service/third-party elements to minimize scope?
Assumptions for MVP:
Scale: 1,000 Peak TPS.
Scope: Support Card payments via external PSPs (e.g., Stripe, Adyen).
Consistency: Strong ACID compliance for the internal Ledger and Transaction state.
Idempotency: Mandatory 24-hour idempotency window for all API calls to prevent double charging.
Thinking Process
Core Bottleneck: Preventing double-charging in a distributed system under network instability.
Step 1: Implement a robust Idempotency Layer at the API Gateway/Service entry point using a distributed cache.
Step 2: Design a State Machine for payment processing (Initiated -> Pending -> Succeeded/Failed) to handle asynchronous callbacks from external PSPs.
Step 3: Establish a Double-Entry Ledger as the single source of truth for all money movements to ensure auditability and data integrity.
Step 4: Utilize an Asynchronous Event Pipeline for non-critical path actions like email receipts, webhooks, and analytics.
Bonus Points
Deterministic Idempotency Keys: Implementing client-generated UUIDs paired with request fingerprinting to prevent key collision or malicious reuse.
Saga Pattern for Distributed Transactions: Using an orchestration-based Saga to manage the lifecycle between the Payment Service, Ledger, and external PSPs without blocking.
PSP Smart Routing: For high availability, implementing a routing engine that can failover between PSPs if one gateway returns a 5xx or latency spikes.
Zero-Downtime Database Migrations: Using a "Expand and Contract" pattern for ledger schema changes to ensure 99.999% availability.
Design Breakdown
Functional Requirements
Core Use Cases:
Accept payments via Credit/Debit cards.
Track payment status and history.
Process payouts/refunds.
Maintain a ledger of all transactions.
Scope Control:
In-Scope: API-driven payment processing, Idempotency handling, Ledgering, Webhooks.
Out-of-Scope: Physical POS terminals, Fraud detection engine (will use PSP's basic fraud check), Tax calculation services.
Non-Functional Requirements
Scale: Support up to 10M transactions per day.
Latency: API response time < 500ms for payment initiation.
Availability: 99.999% (High availability is critical for financial services).
Consistency: Strong consistency for Ledger and Transaction states (ACID).
Fault Tolerance: Automatic retries for transient PSP failures; DLQs for failed webhooks.
Security: PCI-DSS compliance (Tokenization), TLS 1.3 encryption, and API Key authentication.
Estimation
Traffic Estimation:
10M transactions/day \approx 115 Average TPS.
Peak (5x avg) \approx 600 TPS.
Storage Estimation:
1 Payment Record + Ledger Entries \approx 2KB.
10M records/day * 2KB = 20GB/day.
3 Years Retention \approx 22TB.
Bandwidth Estimation:
Incoming: 600 TPS * 5KB request \approx 3MB/s.
Outgoing: 600 TPS * 5KB response \approx 3MB/s.
Blueprint
Concise Summary: A microservices-based architecture centered around a Payment Orchestrator that manages state transitions between a distributed Idempotency store, a Relational DB for transaction metadata, and a Double-entry Ledger.
Major Components:
Payment Service: Orchestrates the payment lifecycle and enforces idempotency.
Ledger Service: An immutable, append-only record of all financial movements.
PSP Gateway: An abstraction layer to communicate with external providers like Stripe/Adyen.
Webhook Manager: Handles asynchronous success/failure signals from external providers.
Simplicity Audit: This design avoids complex distributed locks by utilizing database-level transactions (ACID) and a centralized idempotency store, providing the highest reliability with the lowest operational overhead for an MVP.
Architecture Decision Rationale:
Why this architecture?: Relational databases (Postgres) provide the ACID guarantees necessary for financial data.
Functional Satisfaction: Covers the entire lifecycle from payment initiation to final settlement.
Non-functional Satisfaction: High availability via stateless services and horizontal scaling; Consistency via RDBMS transactions.
High Level Architecture
Sub-system Deep Dive
Edge (Optional)
Content Delivery & Traffic Routing: DNS-based global load balancing (Route 53) directing traffic to the nearest healthy region.
Security & Perimeter:
API Gateway: Handles AuthN/AuthZ via Merchant API Keys.
Rate Limiting: Tiered limits (e.g., 100 requests/sec per merchant) to prevent DoS.
SSL Termination: Enforced TLS 1.3 for all traffic.
Service
Topology & Scaling: Stateless Payment and Ledger services deployed in Multi-AZ EKS clusters. Scaling based on CPU and Request Count.
API Schema Design:
POST /v1/payments:Request:
{amount, currency, source_token, idempotency_key}Response:
{payment_id, status: "pending"}Idempotency: Returns existing resource if key is reused within 24h.
Resilience & Reliability:
Retries: Exponential backoff for 5xx errors from PSPs.
Circuit Breaker: Trips if a specific PSP has a failure rate > 50%.
Observability:
Metrics: Transaction Success Rate (TSR), P99 Latency per PSP.
Tracing: Distributed tracing via OpenTelemetry to track a payment across services.
Storage
Access Pattern: Heavy write (new transactions) and heavy point-reads (status checks).
Database Table Design:
Payments Table:
id (PK), merchant_id, amount, status (enum), psp_reference, idempotency_key (unique).Ledger Table:
id (PK), payment_id, account_id, type (debit/credit), amount, timestamp.Technical Selection: PostgreSQL.
Rationale: Robust support for ACID, JSONB for flexible PSP response storage, and excellent tooling for replication.
Distribution Logic: Sharding by
merchant_id to ensure related transactions stay on the same node while allowing horizontal growth.Cache
Purpose & Justification: Idempotency enforcement. To prevent race conditions where two identical requests arrive simultaneously, we check/set the idempotency key in a fast cache.
Key-Value Schema:
key: idempotency_key:{merchant_id}:{uuid}, value: {status, response_body}.Technical Selection: Redis with AOF (Append Only File) enabled for durability.
Failure Handling: If Redis is down, the Payment Service falls back to the database
idempotency_key unique constraint (slower but safe).Messaging
Purpose & Decoupling: Decouples payment execution from downstream side-effects (Notifications, Business Intelligence).
Event / Topic Schema:
payment.succeeded, payment.failed. Payload contains payment_id, merchant_id, and timestamp.Technical Selection: Kafka.
Rationale: High throughput, long retention for replayability if downstream services fail.
Failure Handling: Dead-letter queues (DLQ) for messages that fail processing after 5 retries.
Wrap Up
Advanced Topics
Trade-offs: We chose Consistency over Availability (CP) in the PACELC model. If the database is unavailable, we refuse to process payments rather than risk inconsistent ledger states or double-charging.
Idempotency Strategy: We implement "Execute-Once" semantics. The first request creates a "Processing" record in Redis; subsequent requests with the same key wait or receive a "Conflict" response until the first one completes.
Reliability & Failure Handling:
PSP Timeouts: If a PSP call times out, the system marks the payment as
pending and uses a background reconciliation worker to query the PSP's status later.Security & Privacy:
Tokenization: We never store raw Credit Card numbers. We store "Source Tokens" provided by PSP-hosted fields (e.g., Stripe Elements).
Distinguishing Insights:
Reconciliation Worker: A critical background process that compares the internal Ledger against the daily CSV exports from PSPs to find discrepancies (the "missing" 0.01% of transactions).
Double-Entry Integrity: Every "Pay-in" must result in a Credit to the Merchant account and a Debit to the System Clearing account.