The Question
ML Design

Scalable Enterprise Document Classification System

Design a high-scale document classification system capable of processing 10 million diverse documents (PDFs, images, emails) per day for an automated business workflow. The system must categorize documents into 50+ classes with a P99 latency under 500ms. Your design should address the full ML lifecycle including OCR integration, handling long-form text, class imbalance in training data, and a strategy for ensuring online model reliability and monitoring for concept drift.
DistilBERT
Transformers
XGBoost
OCR
Tesseract
FastAPI
Kafka
Spark
Quantization
SHAP
Questions & Insights

Clarifying Questions

Business Goal: What is the primary objective of the classification? (e.g., Routing support tickets, legal document discovery, or automated bookkeeping?)
Assumption: We are building an enterprise-grade classifier for an automated workflow system (e.g., sorting Invoices, Contracts, Resumes, and Emails) to reduce manual processing time.
Constraints & Scale: What is the document volume and latency requirement?
Assumption: 10 million documents per day, ~200 requests per second (QPS) peak, with a P99 latency budget of 500ms (including OCR if needed).
Data Characteristics: Are documents raw text, HTML, or scanned PDFs? How many classes?
Assumption: A mix of digital text and scanned PDFs. We have ~50 target categories.
Edge Cases: How do we handle multi-page documents or documents that fit into multiple categories?
Assumption: We will treat this as a multi-class problem (one label per doc) for the MVP. For long documents, we will focus on the first N pages or use a sliding window.

Thinking Process

Identify the Bottleneck: In document classification, the bottleneck is often not the model itself, but the data ingestion and extraction (OCR). If a document is a scanned image, the cost of OCR dominates latency and compute.
Retrieval vs. Ranking: Unlike RecSys, this is a pure classification task. However, for a huge number of classes (e.g., 10,000 product categories), we might use a two-stage approach (Retrieval/Candidate Generation then Ranking). For 50 classes, a flat classifier is more efficient (YAGNI).
Complexity Trade-off: Deep learning (Transformers) provides high accuracy for semantic understanding but is expensive. A baseline of TF-IDF + XGBoost is often sufficient for keyword-heavy documents (like "Invoices"). I will start with a hybrid approach: a lightweight Transformer for the MVP.
Scaling: We need an asynchronous processing pipeline for heavy documents and a synchronous API for small text snippets.

Elite Bonus Points

Layout-Aware Embeddings: Using models like LayoutLM that incorporate 2D spatial coordinates of text tokens (bounding boxes) to distinguish between a "Date" at the top of an invoice vs. the bottom of a contract.
Cold Start for New Classes: Implementing Few-Shot Learning or using LLM-based synthetic data generation to bootstrap classification for a new document type where historical labels don't exist.
OCR Quality-Aware Inference: Feeding OCR confidence scores into the model to handle "noisy" text from low-quality scans differently than high-fidelity digital text.
Cost-Optimized Inference: Implementing a Cascaded Inference strategy: run a cheap FastText model first; if the confidence is below 0.9, route to a heavy Transformer model.
Design Breakdown

Requirements

Product Goal: Automatically categorize incoming documents to trigger specific downstream business logic (e.g., paying an invoice).
Success Metrics:
Online Metrics: Accuracy (Top-1), Throughput, Reduction in manual routing.
Offline Metrics: Macro-F1 Score (to handle class imbalance), Precision/Recall per class.
Guardrail Metrics: P99 Latency, OCR Error Rate, Cost per classification.
System Constraints: Support 10M docs/day; handle PDF/PNG/JPG/TXT formats.
Data Availability: Historical labeled dataset of 1M documents; real-time stream of incoming files via S3/Kafka.

ML Problem Framing

ML Task Type: Multi-class Classification.
Prediction Target: P(\text{class}_i | \text{document content}).
Inputs:
Textual Features: Raw text extracted via OCR or PDF parsing.
Spatial Features: Bounding boxes of words (for scanned docs).
Metadata: File size, file extension, sender domain (if email).
Outputs: A probability distribution over 50 classes.
ML Challenges:
Class Imbalance: "Invoices" might be 100x more common than "Legal Decrees".
Long Context: Standard BERT handles 512 tokens; legal docs are 50+ pages.
Noisy Labels: Human annotators often disagree on document types.

Design Summary & MVP

Concise Summary: An asynchronous pipeline that extracts text using OCR, generates embeddings via a DistilBERT model, and classifies documents using a Softmax head.
Model Architecture & Selection:
Baseline Model: TF-IDF + Logistic Regression (fast, interpretable).
Target Model: DistilBERT (Transformer).
Choice Rationale: Transformers capture semantic context better than N-grams (e.g., distinguishing "Statement of Work" from "Work Statement"). DistilBERT provides 95% of BERT's performance at 40% less latency.
Simplicity Audit: Avoided Hierarchical Attention Networks or LayoutLM for the MVP unless the data is purely visual. DistilBERT handles text-heavy enterprise docs efficiently.

System Architecture

Pipeline Deep Dive

Data Pipeline

Data Source: Documents arrive via S3 upload events or Kafka streams (e.g., from an email ingestion worker).
Data Ingestion: Asynchronous processing is mandatory due to variable OCR time. We use a message queue (SQS/Kafka) to decouple ingestion from classification.
Data Storage: Raw files in S3. Extracted text and metadata in a NoSQL Store (DynamoDB/Cassandra) for fast retrieval during inference.
Data Processing:
OCR: For images/scans, use a scalable OCR engine. For digital PDFs, use pdfplumber to extract text directly (cheaper/faster).
Cleaning: Remove PII, normalize whitespace, and handle encoding issues.

Feature Pipeline

Feature Engineering:
Textual: Sub-word tokenization (WordPiece) to handle out-of-vocabulary terms.
Structural: Page count, presence of tables, image-to-text ratio.
Online Feature Pipeline: Real-time tokenization and metadata lookup.
Offline Feature Pipeline: Batch job (Spark) to generate embeddings for the entire historical corpus to speed up training.
Training/Serving Skew: Use a shared Tokenizer Library and a shared Preprocessing Script in a Docker container to ensure training data matches production inputs exactly.

Model Architecture

Problem Formulation: Supervised Multi-class Classification.
Architecture Design:
Backbone: DistilBERT (6-layer Transformer).
Pooling Layer: Use the [CLS] token embedding as the document representation.
Output Layer: Fully connected layer with 50 neurons and Softmax activation.
Handling Long Documents:
MVP Approach: Truncate to the first 512 tokens. In most business docs, the intent is in the first page (header/title).
Advanced Approach: Sliding window with max-pooling over embeddings of different sections.
Optimization: Post-training quantization (FP16/INT8) to reduce inference latency on CPUs.

Training Pipeline

Dataset Construction: Use Stratified Sampling to ensure minority classes are represented. Apply Data Augmentation (synonym replacement, back-translation) for rare document types.
Data Splitting: 80/10/10 split. Ensure no "customer leakage" (documents from the same sender should not be in both train and test).
Retraining: Triggered monthly or when Data Drift is detected (e.g., a new invoice format from a major supplier).

Serving Pipeline

Pattern: Request-Response via a FastAPI/gRPC service.
Latency Optimization:
Batching: Group incoming requests (size 8-16) to leverage GPU/CPU vectorization.
Caching: MD5 hash the extracted text; if we see the exact same text again, return the cached result.
Reliability: If the Transformer service times out, fallback to a Heuristic Keyword Model (e.g., if "Invoice" appears 3 times, classify as Invoice).

Evaluation Pipeline

Offline: Use Confusion Matrix to identify which classes are being confused (e.g., "Contract" vs. "Amendment").
Online: A small percentage of "High Confidence" predictions are sampled for human verification to calculate "Production Accuracy."

Monitoring Pipeline

System: Monitor OCR failure rates and inference latency.
Model: Track Prediction Drift. If the model suddenly starts classifying 80% of docs as "Unknown," trigger an alert.
Data: Monitor the distribution of document lengths and languages.
Wrap Up

Final Evaluation

Observability: Use SHAP or LIME for model interpretability (e.g., "Why was this marked as a Resume?"). This is critical for business trust.
Edge Cases:
Cold Start: Use a "Miscellaneous" category for low-confidence scores.
Multi-page: Process the first and last page separately and concatenate embeddings.
Trade-offs:
Accuracy vs. Latency: We chose DistilBERT over BERT-Large to save 300ms of CPU time per request.
Complexity vs. Maintainability: Avoided multi-modal LayoutLM for MVP to keep the OCR-to-Feature pipeline simple.