The Question
DesignScalable Digital Wallet System
Design a high-performance digital wallet system that supports secure P2P money transfers, maintains strict financial consistency, and handles millions of transactions daily while ensuring a full audit trail and protection against double-spending.
PostgreSQL
Redis
Kafka
Double-Entry Bookkeeping
Outbox Pattern
Questions & Insights
Clarifying Questions
What is the scale of the system? (Assumption: 10M DAU, 100M total users, peak 10,000 Transactions Per Second (TPS)).
What are the core features for the MVP? (Assumption: P2P transfers, balance inquiries, and transaction history. KYC and external bank integrations are handled by existing third-party mock services).
What are the consistency requirements? (Assumption: Strict consistency is mandatory for financial balances; eventual consistency is acceptable for transaction history and notifications).
Is this a multi-currency system? (Assumption: No, MVP will support a single currency to avoid complex FX rate management and rounding issues).
How long should we retain transaction history? (Assumption: 7 years for regulatory compliance, but only the last 6 months need to be high-performance).
Thinking Process
The Core Bottleneck: Database contention on hot accounts (e.g., a popular merchant or a platform's system account).
Key Strategy:
How do we ensure money is never created or lost? Use Double-Entry Bookkeeping at the database level.
How do we handle high concurrency? Implement Idempotency via unique request keys and optimistic locking for balance updates.
How do we scale the ledger? Use a Relational Database with sharding by
account_id while maintaining a centralized audit log.How do we ensure system reliability? Decouple the "Core Transaction" (atomic balance update) from "Side Effects" (notifications, history indexing) using an Outbox Pattern.
Bonus Points
Double-Entry Bookkeeping Integrity: Implementing a "Total Sum Zero" constraint across the ledger to detect internal fraud or software bugs in real-time.
Hot Partition Mitigation: Using "Slotting" or "Sharding" for high-volume system accounts (e.g., the platform's commission account) to prevent row-level locking bottlenecks.
Distributed Tracing for Financial Flow: Injecting a
trace_id that links a user's click to the ledger entry, the message queue, and the final notification for sub-second debugging.Deterministic State Machine: Designing the transaction processor as a pure function to allow for "Replayability" during disaster recovery.
Design Breakdown
Functional Requirements
Users can create a digital wallet.
Users can check their current balance.
Users can transfer funds to other users (P2P).
Users can view a history of their transactions.
System must ensure idempotency for all financial operations.
Non-Functional Requirements
Strong Consistency: Balances must always be accurate and atomic.
High Availability: 99.99% availability for read operations; 99.9% for write operations.
Scalability: Support up to 10k TPS.
Auditability: Every change to a balance must have an immutable audit trail.
Low Latency: Wallet-to-wallet transfers should complete under 200ms.
Estimation
Storage: 100M users 1KB per user profile = 100GB. 10k TPS 86,400s * 500 bytes per transaction = ~430GB/day. Total ~150TB per year (requires aggressive cold storage/archiving).
Throughput: 10k TPS is manageable by a well-sharded RDBMS cluster (e.g., RDS Aurora with 10+ shards).
Cache: 10M DAU checking balance 5x/day = 50M reads/day (~600 QPS). Easily handled by a single small Redis node.
Blueprint
Concise Summary: A microservices-based architecture centered around a strictly consistent Ledger Service using an RDBMS for ACID guarantees, complemented by an event-driven flow for non-critical path operations.
Major Components:
API Gateway: Handles authentication, rate limiting, and request routing.
Wallet Service: Manages user wallet metadata and state.
Transaction Service: Coordinates the "transfer" workflow and ensures idempotency.
Ledger Service: The source of truth; executes atomic double-entry bookkeeping movements.
Notification Service: Asynchronously informs users of transaction outcomes.
Simplicity Audit: This design avoids complex distributed transaction coordinators (like Saga/2PC) for the MVP by consolidating the ledger into a single logical (though sharded) database.
Architecture Decision Rationale:
Why this architecture?: RDBMS is the industry standard for financial integrity. The separation of Transaction (orchestration) and Ledger (execution) allows the ledger to remain slim and high-performance.
Functional Requirement Satisfaction: All P2P and balance needs are met with high integrity.
Non-functional Requirement Satisfaction: Scalability is achieved via database sharding; availability is achieved via multi-AZ deployment.
High Level Architecture
Sub-system Deep Dive
Service
Topology & Scaling: Stateless microservices deployed in Kubernetes (K8s) across 3 Availability Zones. Auto-scaling based on CPU (threshold 60%) and Request Count.
API Schema Design:
POST /v1/transfers: Protocol: REST
Request:
{ sender_id, receiver_id, amount, idempotency_key }Response:
{ transaction_id, status, timestamp }Idempotency: Mandatory
idempotency_key stored in Redis for 24h.Rate Limit: 10 req/sec per user.
Resilience & Reliability:
Exponential backoff for internal service calls.
Circuit breakers on the Ledger Service to prevent cascading failures if the DB is under load.
Observability: Prometheus for RED metrics; Jaeger for tracing transaction flow.
Security: JWT-based AuthN; mTLS for service-to-service communication; AES-256 encryption at rest for sensitive PII.
Storage
Access Pattern:
Writes: High-frequency append-only ledger entries.
Reads: High-frequency point lookups for balance.
Database Table Design:
Accounts: account_id (PK), user_id, balance, version (for optimistic locking).Ledger: entry_id (PK), account_id, transaction_id, amount (positive/negative), type (debit/credit), created_at.Technical Selection: PostgreSQL (RDS Aurora). Rationale: Strong ACID compliance, mature ecosystem, and excellent support for partitioning.
Distribution Logic: Sharded by
account_id using consistent hashing. Cross-shard transfers are handled by a 2-phase commit within the Ledger Service (limited to the MVP context).Reliability & Recovery: Daily snapshots + Continuous WAL (Write Ahead Log) archiving to S3 for Point-in-Time Recovery (PITR).
Cache
Purpose & Justification: Reduces read pressure on the
Accounts table for balance inquiries.Key-Value Schema:
Key:
bal:{account_id}Value:
decimal_amountTTL: 30 seconds (Write-through update from Ledger Service).
Technical Selection: Redis (Cluster Mode).
Failure Handling: If Redis is down, fallback to the
Accounts table in Postgres (Direct-to-DB).Messaging
Purpose & Decoupling: Decouples the Ledger (critical path) from History and Notifications (non-critical).
Event / Topic Schema:
transaction.completed topic. Payload: { tx_id, from_account, to_account, amount, timestamp }.Throughput & Partitioning: Kafka with 32 partitions, keyed by
transaction_id to ensure ordering for specific transactions.Failure Handling: Dead-letter queues (DLQ) for failed notification attempts.
Wrap Up
Advanced Topics
Monitoring: Critical alert on "Sum of Ledger Entries != Sum of Account Balances."
Trade-offs: We chose an RDBMS over a NoSQL database to prioritize Consistency (C) over Availability (A) in the CAP theorem, which is essential for money.
Bottlenecks: The primary bottleneck is the row-level lock on a single account during a transfer.
Alternatives: For extreme scale, we could use a "LMAX Disruptor" pattern (in-memory lock-free execution), but it increases operational complexity significantly.