DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
The Question
Design

Mortgage Agent Lead and Document Management System

Design a secure, highly-available mortgage agent platform that facilitates lead management, document collection, and integration with third-party financial services. The system must handle high-volume document uploads (PDFs/Images), ensure strict PII data protection (GLBA compliance), and manage asynchronous workflows like credit report fetching and loan status updates. Focus on document lifecycle management, external API integration reliability, and a robust security architecture for financial data.
PostgreSQL
S3
Redis
SQS
AWS Lambda
AWS KMS
OIDC
AES-GCM
CDC
gRPC
Questions & Insights

Clarifying Questions

What is the target scale for the MVP?
Assumption: 10,000 active mortgage agents, each managing ~100 active loan applications simultaneously.
What are the primary document requirements?
Assumption: High volume of PDF/Image uploads (bank statements, tax returns). Each application averages 50 documents (~100MB total).
Does the app need to integrate with external credit bureaus or Loan Origination Systems (LOS)?
Assumption: Integration with at least one major credit bureau (e.g., Equifax) and a generic LOS API is required for the MVP.
What is the regulatory/compliance landscape?
Assumption: The system must adhere to GLBA (Gramm-Leach-Bliley Act) and SOC2 standards, requiring strict PII (Personally Identifiable Information) encryption.
Is real-time communication required?
Assumption: Asynchronous notifications (Email/SMS) are sufficient for the MVP; no real-time WebSocket chat is needed yet.

Thinking Process

Security-First Architecture: Mortgage data is highly sensitive. The design must center on data isolation and encryption at rest/transit.
Document Management Bottleneck: Large file uploads and processing (OCR/Virus scanning) are the primary heavy-lifting tasks.
Integration Management: Handling brittle 3rd party financial APIs requires robust circuit breaking and idempotent webhooks.
Step-by-Step Evolution:
How do we securely capture and store lead/borrower data?
How do we manage the asynchronous lifecycle of document verification?
How do we integrate with external credit and banking systems without blocking the UI?

Bonus Points

Envelope Encryption: Using a Master Key in AWS KMS to encrypt Data Encryption Keys (DEKs) for each loan folder to ensure high-security isolation.
Zero-Knowledge Document Triage: Implementing pre-signed URLs for direct S3 uploads to minimize server-side exposure to PII.
Transactional Outbox Pattern: Ensuring consistency between the Loan Database and the Notification Service to prevent missing updates on loan status changes.
Multi-Region Disaster Recovery: Active-Passive setup with cross-region S3 replication and RDS Aurora Global Database for a < 15 min RTO.
Design Breakdown

Functional Requirements

Core Use Cases:
Agent can create and manage Leads.
Borrower can upload financial documents via a secure link.
Agent can request credit reports for leads.
System tracks Loan Application status (Draft, Submitted, Underwriting, Closed).
Scope Control:
In-scope: Lead management, document storage, external credit check integration, status tracking.
Out-of-scope: Full automated underwriting engine, real-time video conferencing, property valuation (Appraisal) management.

Non-Functional Requirements

Scale: Support 1M+ total documents and 100k+ lead records.
Latency: API response < 200ms for dashboard operations; document processing < 5s.
Availability & Reliability: 99.9% uptime; no data loss for financial documents.
Consistency: Strong consistency for loan status and financial figures (Relational DB required).
Fault Tolerance: Handle 3rd party API downtimes via retries and DLQs.
Security & Privacy: AES-256 encryption, MFA for agents, and RBAC (Role-Based Access Control).

Estimation

Traffic Estimation:
10k DAU (Agents) * 50 requests/day = 500k daily requests.
Average QPS = ~6; Peak QPS = ~100.
Storage Estimation:
1M active loans * 100MB docs/loan = 100TB S3 storage.
Metadata: 1M loans * 10KB/record = 10GB (Relational DB).
Bandwidth Estimation:
Uploads: 1M docs/month * 2MB/doc = 2TB/month (~8 Mbps avg).

Blueprint

Concise Summary: A microservice-based architecture utilizing a Relational Database for ACID compliance on loan data and Object Storage for heavy document management.
Major Components:
API Gateway: Entry point for authentication, rate limiting, and request routing.
Loan/Lead Service: Manages the core business logic and state transitions of mortgages.
Document Service: Handles secure uploads via pre-signed URLs and manages metadata.
Integration Service: Orchestrates calls to Credit Bureaus and external Bank APIs.
Notification Service: Sends asynchronous updates to agents and borrowers via SQS.
Simplicity Audit: This architecture avoids complex service meshes or real-time streaming (Flink/Kafka) in favor of simple SQS queues and a managed RDBMS to ensure reliability and ease of audit for the MVP.
Architecture Decision Rationale:
Relational DB: Essential for financial integrity (ACID) and complex queries over lead status.
SQS for Integrations: 3rd party APIs are unreliable; queues allow for retries without impacting user experience.
Pre-signed URLs: Reduces server load and PII exposure by moving data transfer to the edge.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery & Traffic Routing:
CloudFront used for serving the static frontend (React/Mobile) and caching public marketing assets.
Latency-based DNS routing via Route53.
Security & Perimeter:
API Gateway handles OIDC (OpenID Connect) validation.
WAF (Web Application Firewall) enabled to block SQL injection and common OWASP Top 10 threats.
Rate limiting set at 50 requests per second per IP to prevent scraping of lead data.

Service

Topology & Scaling:
Stateless services running on AWS EKS (Kubernetes) for auto-scaling based on CPU/Memory.
Multi-AZ deployment to ensure high availability.
API Schema Design:
POST /v1/loans: Create a new loan application. (REST)
GET /v1/loans/{id}/docs/upload-url: Request a pre-signed S3 URL for secure upload.
GET /v1/credit-report/{leadId}: Trigger an async credit check.
Idempotency: All POST requests require an X-Idempotency-Key to prevent duplicate loan creations.
Resilience & Reliability:
Exponential backoff (initial 1s, max 30s) for all integration worker retries.
Circuit Breakers (Resilience4j) on the Integration Worker to stop hammering Credit Bureau APIs if they return 5xx errors.

Cache

Purpose & Justification: Reduces load on the Loan DB for frequently accessed agent session data and rate-limiting counters.
Key-Value Schema:
session:{userId} -> session_token (TTL 24h).
ratelimit:{ip} -> count (TTL 1m).
Technical Selection: Redis. Provides sub-millisecond latency for session management.
Failure Handling: If Redis fails, the system falls back to the DB for session verification (slight latency hit).

Messaging

Purpose & Decoupling: Decouples the core Loan Service from slow, unreliable external integrations (Credit/LOS) and notification delivery.
Event / Topic Schema:
loan.status.changed: Payload contains loanId, oldStatus, newStatus.
doc.uploaded: Payload contains s3Key, loanId.
Throughput & Partitioning: SQS Standard queues are sufficient for the expected volume (~10-50 messages/sec).
Failure Handling: Dead-letter queues (DLQ) configured for all queues. Failed integrations are alerted to Ops for manual intervention.

Data Processing

Processing Model: Simple worker-based processing for document validation.
Processing DAG:
S3 Upload -> Trigger Lambda -> Virus Scan (ClamAV) -> Update Doc Metadata Table.
Technical Selection: AWS Lambda. Cost-effective for event-driven document scanning without maintaining persistent servers.

Infrastructure (Optional)

Observability:
Prometheus/Grafana for monitoring QPS and Error rates.
ELK Stack (Elasticsearch, Logstash, Kibana) for centralized logging (crucial for financial auditing).
Platform Security:
Encryption at rest for RDS and S3 using AWS KMS.
All PII fields in PostgreSQL are encrypted at the application layer using AES-GCM before storage.
Wrap Up

Advanced Topics

Trade-offs: We chose PostgreSQL over NoSQL. While NoSQL scales horizontally better, the relational nature of mortgage data (Agents -> Loans -> Documents -> Status Logs) and the need for strict ACID transactions make SQL the superior choice for an MVP.
Reliability & Failure Handling: The "Integration Worker" is the most vulnerable point. We use a Saga Pattern (Choreography-based) to manage the multi-step process of lead creation -> credit check -> LOS submission.
Security & Privacy: We implement "Field Level Encryption" for SSNs and financial figures. Even with DB access, an attacker cannot read PII without the KMS key stored in a separate VPC/Account.
Distinguishing Insights: For a Staff-level design, we emphasize Auditability. Every change to a loan status is recorded in an audit_log table (Immutable Append-Only) to satisfy compliance requirements for financial lending.