DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
The Question
Design

Scalable Payment Orchestration & Ledger System

Design a globally distributed payment system similar to Stripe. The system must support high-volume transaction processing, ensure strict idempotency to prevent double-charging, and maintain an immutable ledger for financial auditing. Focus on the end-to-end lifecycle of a payment, including interaction with external Payment Service Providers (PSPs), handling asynchronous webhooks, and ensuring system-wide consistency and PCI-DSS compliance constraints.
PostgreSQL
Redis
Kafka
PCI-DSS
Saga Pattern
TLS 1.3
OpenTelemetry
Stripe
Adyen
Docker
Kubernetes
Questions & Insights

Clarifying Questions

Scale and Throughput: What is the target peak Transactions Per Second (TPS)? For an MVP, are we targeting 100 TPS or 10,000 TPS?
Payment Methods: Should we support multiple Payment Service Providers (PSPs) (e.g., Adyen, Braintree) or a single gateway for the MVP?
Consistency Requirements: Is strict serializability required for the internal ledger, or is eventual consistency acceptable for reporting?
Geographic Scope: Is this a single-region deployment or do we need to handle cross-border payments and currency conversions immediately?
Compliance: Is the system expected to be fully PCI-DSS compliant, or will we use a vaulting service/third-party elements to minimize scope?
Assumptions for MVP:
Scale: 1,000 Peak TPS.
Scope: Support Card payments via external PSPs (e.g., Stripe, Adyen).
Consistency: Strong ACID compliance for the internal Ledger and Transaction state.
Idempotency: Mandatory 24-hour idempotency window for all API calls to prevent double charging.

Thinking Process

Core Bottleneck: Preventing double-charging in a distributed system under network instability.
Step 1: Implement a robust Idempotency Layer at the API Gateway/Service entry point using a distributed cache.
Step 2: Design a State Machine for payment processing (Initiated -> Pending -> Succeeded/Failed) to handle asynchronous callbacks from external PSPs.
Step 3: Establish a Double-Entry Ledger as the single source of truth for all money movements to ensure auditability and data integrity.
Step 4: Utilize an Asynchronous Event Pipeline for non-critical path actions like email receipts, webhooks, and analytics.

Bonus Points

Deterministic Idempotency Keys: Implementing client-generated UUIDs paired with request fingerprinting to prevent key collision or malicious reuse.
Saga Pattern for Distributed Transactions: Using an orchestration-based Saga to manage the lifecycle between the Payment Service, Ledger, and external PSPs without blocking.
PSP Smart Routing: For high availability, implementing a routing engine that can failover between PSPs if one gateway returns a 5xx or latency spikes.
Zero-Downtime Database Migrations: Using a "Expand and Contract" pattern for ledger schema changes to ensure 99.999% availability.
Design Breakdown

Functional Requirements

Core Use Cases:
Accept payments via Credit/Debit cards.
Track payment status and history.
Process payouts/refunds.
Maintain a ledger of all transactions.
Scope Control:
In-Scope: API-driven payment processing, Idempotency handling, Ledgering, Webhooks.
Out-of-Scope: Physical POS terminals, Fraud detection engine (will use PSP's basic fraud check), Tax calculation services.

Non-Functional Requirements

Scale: Support up to 10M transactions per day.
Latency: API response time < 500ms for payment initiation.
Availability: 99.999% (High availability is critical for financial services).
Consistency: Strong consistency for Ledger and Transaction states (ACID).
Fault Tolerance: Automatic retries for transient PSP failures; DLQs for failed webhooks.
Security: PCI-DSS compliance (Tokenization), TLS 1.3 encryption, and API Key authentication.

Estimation

Traffic Estimation:
10M transactions/day \approx 115 Average TPS.
Peak (5x avg) \approx 600 TPS.
Storage Estimation:
1 Payment Record + Ledger Entries \approx 2KB.
10M records/day * 2KB = 20GB/day.
3 Years Retention \approx 22TB.
Bandwidth Estimation:
Incoming: 600 TPS * 5KB request \approx 3MB/s.
Outgoing: 600 TPS * 5KB response \approx 3MB/s.

Blueprint

Concise Summary: A microservices-based architecture centered around a Payment Orchestrator that manages state transitions between a distributed Idempotency store, a Relational DB for transaction metadata, and a Double-entry Ledger.
Major Components:
Payment Service: Orchestrates the payment lifecycle and enforces idempotency.
Ledger Service: An immutable, append-only record of all financial movements.
PSP Gateway: An abstraction layer to communicate with external providers like Stripe/Adyen.
Webhook Manager: Handles asynchronous success/failure signals from external providers.
Simplicity Audit: This design avoids complex distributed locks by utilizing database-level transactions (ACID) and a centralized idempotency store, providing the highest reliability with the lowest operational overhead for an MVP.
Architecture Decision Rationale:
Why this architecture?: Relational databases (Postgres) provide the ACID guarantees necessary for financial data.
Functional Satisfaction: Covers the entire lifecycle from payment initiation to final settlement.
Non-functional Satisfaction: High availability via stateless services and horizontal scaling; Consistency via RDBMS transactions.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery & Traffic Routing: DNS-based global load balancing (Route 53) directing traffic to the nearest healthy region.
Security & Perimeter:
API Gateway: Handles AuthN/AuthZ via Merchant API Keys.
Rate Limiting: Tiered limits (e.g., 100 requests/sec per merchant) to prevent DoS.
SSL Termination: Enforced TLS 1.3 for all traffic.

Service

Topology & Scaling: Stateless Payment and Ledger services deployed in Multi-AZ EKS clusters. Scaling based on CPU and Request Count.
API Schema Design:
POST /v1/payments:
Request: {amount, currency, source_token, idempotency_key}
Response: {payment_id, status: "pending"}
Idempotency: Returns existing resource if key is reused within 24h.
Resilience & Reliability:
Retries: Exponential backoff for 5xx errors from PSPs.
Circuit Breaker: Trips if a specific PSP has a failure rate > 50%.
Observability:
Metrics: Transaction Success Rate (TSR), P99 Latency per PSP.
Tracing: Distributed tracing via OpenTelemetry to track a payment across services.

Storage

Access Pattern: Heavy write (new transactions) and heavy point-reads (status checks).
Database Table Design:
Payments Table: id (PK), merchant_id, amount, status (enum), psp_reference, idempotency_key (unique).
Ledger Table: id (PK), payment_id, account_id, type (debit/credit), amount, timestamp.
Technical Selection: PostgreSQL.
Rationale: Robust support for ACID, JSONB for flexible PSP response storage, and excellent tooling for replication.
Distribution Logic: Sharding by merchant_id to ensure related transactions stay on the same node while allowing horizontal growth.

Cache

Purpose & Justification: Idempotency enforcement. To prevent race conditions where two identical requests arrive simultaneously, we check/set the idempotency key in a fast cache.
Key-Value Schema: key: idempotency_key:{merchant_id}:{uuid}, value: {status, response_body}.
Technical Selection: Redis with AOF (Append Only File) enabled for durability.
Failure Handling: If Redis is down, the Payment Service falls back to the database idempotency_key unique constraint (slower but safe).

Messaging

Purpose & Decoupling: Decouples payment execution from downstream side-effects (Notifications, Business Intelligence).
Event / Topic Schema: payment.succeeded, payment.failed. Payload contains payment_id, merchant_id, and timestamp.
Technical Selection: Kafka.
Rationale: High throughput, long retention for replayability if downstream services fail.
Failure Handling: Dead-letter queues (DLQ) for messages that fail processing after 5 retries.
Wrap Up

Advanced Topics

Trade-offs: We chose Consistency over Availability (CP) in the PACELC model. If the database is unavailable, we refuse to process payments rather than risk inconsistent ledger states or double-charging.
Idempotency Strategy: We implement "Execute-Once" semantics. The first request creates a "Processing" record in Redis; subsequent requests with the same key wait or receive a "Conflict" response until the first one completes.
Reliability & Failure Handling:
PSP Timeouts: If a PSP call times out, the system marks the payment as pending and uses a background reconciliation worker to query the PSP's status later.
Security & Privacy:
Tokenization: We never store raw Credit Card numbers. We store "Source Tokens" provided by PSP-hosted fields (e.g., Stripe Elements).
Distinguishing Insights:
Reconciliation Worker: A critical background process that compares the internal Ledger against the daily CSV exports from PSPs to find discrepancies (the "missing" 0.01% of transactions).
Double-Entry Integrity: Every "Pay-in" must result in a Credit to the Merchant account and a Debit to the System Clearing account.