The Question
Design

Secure Mortgage Application Management System

Design a highly secure, scalable platform for mortgage agents to manage leads, track application lifecycles, and handle sensitive document collection. The system must prioritize data integrity for financial records and comply with PII protection standards while ensuring efficient background processing for document verification.
PostgreSQL
S3
SQS
KMS
RBAC
OAuth2
OCR
Questions & Insights

Clarifying Questions

What is the target scale for the MVP? (e.g., number of agents, active mortgage applications per month, and document volume).
What are the core functional boundaries? (e.g., Is this a full CRM, or specifically for document collection, lead tracking, and lender communication?).
What are the regulatory and compliance requirements? (e.g., SOC2, GDPR, or specific financial data residency laws for PII/SSN handling).
Does the app need to integrate with external credit bureaus or lender APIs in the MVP?
What is the expected read/write ratio? (Usually read-heavy for dashboards, write-heavy during document upload phases).
Assumptions for MVP:
Scale: 10,000 agents, 50,000 active applications, ~10 documents per application.
Scope: Lead management, secure document upload/storage, application status tracking, and basic internal messaging.
Security: High priority for PII (Personally Identifiable Information). Data must be encrypted at rest and in transit.
Integrations: External integrations are handled via webhooks or manual status updates for the MVP.

Thinking Process

Core Bottleneck: Secure handling and processing of high-stakes legal documents (PII) while maintaining a responsive UI for agents.
Progressive Strategy:
How do we secure PII? Establish a robust AuthN/AuthZ layer and an encrypted storage strategy for documents and sensitive DB fields.
How do we manage complex state? Use a relational database with a state machine to track mortgage application progress (Lead -> Application -> Underwriting -> Closed).
How do we handle heavy document processing? Use an asynchronous worker pattern to perform virus scanning and OCR without blocking the main API.
How do we ensure auditability? Implement a sidecar or middleware-based audit log to track every access to sensitive data.

Bonus Points

Bitemporal Data Modeling: Storing not just when a record was updated, but the "valid time" (when the change occurred in the real world) versus "system time" for regulatory audit trails.
Field-Level Encryption (FLE): Encrypting specific columns (like SSN) using a per-user or per-application data key before it ever hits the database.
Presigned URLs: Utilizing S3 Presigned URLs for direct-to-cloud uploads to minimize server load and improve security by bypassing the application layer for large binaries.
Zero-Trust Internal Mesh: Using mTLS between the API and workers to ensure that even internal traffic is authenticated and encrypted.
Design Breakdown

Functional Requirements

Lead Management: Agents can create, update, and track leads.
Application Workflow: Move applications through stages (Pre-approval, Submission, etc.).
Secure Document Vault: Upload, view, and organize PDFs/Images (e.g., paystubs, tax returns).
Notifications: Notify agents of status changes or missing documents.
Audit Logging: Track who viewed or edited any part of an application.

Non-Functional Requirements

Security: Encryption at rest/transit; PII protection; Role-Based Access Control (RBAC).
Durability: 99.999999999% durability for uploaded legal documents.
Availability: 99.9% uptime (Standard Business Hours critical).
Consistency: Strong consistency for application status and financial data.

Estimation

Users: 10k agents.
Storage: 50k applications 10 docs 5MB/doc = ~2.5 TB.
QPS: Peak 100 requests/sec (Low volume, high value).
Bandwidth: 100 agents uploading 5MB docs simultaneously = 500 MB/s (needs direct-to-S3 upload).

Blueprint

Concise Summary: A secure, monolithic-first API backed by a relational database for state management and an object store for documents, using an asynchronous worker for background processing.
Major Components:
API Gateway: Handles rate limiting, SSL termination, and initial authentication.
Mortgage Service: Core business logic for lead management, application state transitions, and metadata storage.
Relational Database: Stores structured application data, user profiles, and audit logs.
Object Storage: Securely stores document binaries (PDFs/Images).
Message Queue: Decouples document processing tasks (scanning/OCR) from the main request flow.
Document Worker: Performs background tasks like malware scanning and thumbnail generation.
Simplicity Audit: This architecture avoids microservices overhead and caching layers that are unnecessary for 100 QPS, focusing instead on data integrity and security.
Architecture Decision Rationale:
Why this architecture is the best for this problem?: Mortgage data is highly relational and requires ACID transactions for status updates.
Functional Requirement Satisfaction: Handles lead/application lifecycle via Postgres and document management via S3.
Non-functional Requirement Satisfaction: Uses S3 for durability, RDS Multi-AZ for availability, and SQS for resilient background processing.

High Level Architecture

Sub-system Deep Dive

Service

Topology & Scaling:
Stateless API nodes deployed across Multi-AZ (Availability Zones).
Scaling Signal: CPU and Request Latency (Target < 200ms).
Load Balancing: L7 (ALB) to handle path-based routing and session stickiness if needed.
API Schema Design:
POST /v1/applications: Create new mortgage file.
GET /v1/applications/{id}/docs/upload-url: Returns S3 Presigned URL (Idempotent).
PATCH /v1/applications/{id}/status: Transition application state (Requires RBAC check).
Protocols: REST/JSON over HTTPS.
Resilience & Reliability:
Circuit breakers on the Document Worker to prevent backlog overflow.
Exponential backoff for S3/DB connections.
Observability:
RED metrics (Rate, Error, Duration) via Prometheus.
Audit logs specifically for PII access (Structured logs sent to CloudWatch/ELK).
Security:
OAuth2 with JWT for session management.
RBAC: Agent, Manager, Client roles.

Messaging

Purpose & Decoupling: Offloads virus scanning and OCR (Optical Character Recognition) from the API.
Event / Topic Schema:
document-uploaded: Contains document_id and s3_key.
Throughput & Partitioning: Standard SQS queue; throughput is low, so partitioning is not required.
Failure Handling: Dead-letter queue (DLQ) for documents that fail scanning/processing.
Technical Selection: AWS SQS. Low operational overhead and high reliability.

Data Processing

Processing Model: Event-driven asynchronous processing.
Processing DAG: S3 Upload -> SQS Trigger -> Virus Scan -> OCR/Thumbnail -> Update DB Status.
Correctness Guarantees: At-least-once delivery (idempotent DB updates handle duplicates).
Technical Selection: Custom Go or Python worker nodes. Simple to implement and scale based on queue depth.
Wrap Up

Advanced Topics

Trade-offs: We chose Strong Consistency over high availability (CP in CAP) for application states. It is better for an agent to wait 5 seconds than to have a mortgage file in an inconsistent state.
Reliability & Failure Handling: Document processing is decoupled. If the Worker fails, the API remains functional, though document previews may be delayed.
Security & Privacy: All PII in the DB is encrypted using AWS KMS. The system utilizes "Least Privilege" IAM roles.
Optimization: To handle "Hot Spots" (e.g., an agent with 1000 leads), we use database indexing on agent_id.
Distinguishing Insights:
Document Versioning: S3 versioning is enabled to prevent accidental overwrites of legal docs.
Webhook Support: Future-proofing the design to allow lenders to push status updates back to the app via a secure webhook endpoint.