The Question

Scalable Enterprise Document Classification System

Design a high-scale document classification system capable of processing 10 million diverse documents (PDFs, images, emails) per day for an automated business workflow. The system must categorize documents into 50+ classes with a P99 latency under 500ms. Your design should address the full ML lifecycle including OCR integration, handling long-form text, class imbalance in training data, and a strategy for ensuring online model reliability and monitoring for concept drift.

DistilBERT

Transformers

XGBoost

OCR

Tesseract

FastAPI

Kafka

Spark

Quantization

SHAP

Questions & Insights

Clarifying Questions

Business Goal: What is the primary objective of the classification? (e.g., Routing support tickets, legal document discovery, or automated bookkeeping?)

Assumption: We are building an enterprise-grade classifier for an automated workflow system (e.g., sorting Invoices, Contracts, Resumes, and Emails) to reduce manual processing time.

Constraints & Scale: What is the document volume and latency requirement?

Assumption: 10 million documents per day, ~200 requests per second (QPS) peak, with a P99 latency budget of 500ms (including OCR if needed).

Data Characteristics: Are documents raw text, HTML, or scanned PDFs? How many classes?

Assumption: A mix of digital text and scanned PDFs. We have ~50 target categories.

Edge Cases: How do we handle multi-page documents or documents that fit into multiple categories?

Assumption: We will treat this as a multi-class problem (one label per doc) for the MVP. For long documents, we will focus on the first

N

pages or use a sliding window.

Thinking Process

Identify the Bottleneck: In document classification, the bottleneck is often not the model itself, but the data ingestion and extraction (OCR). If a document is a scanned image, the cost of OCR dominates latency and compute.

Retrieval vs. Ranking: Unlike RecSys, this is a pure classification task. However, for a huge number of classes (e.g., 10,000 product categories), we might use a two-stage approach (Retrieval/Candidate Generation then Ranking). For 50 classes, a flat classifier is more efficient (YAGNI).

Complexity Trade-off: Deep learning (Transformers) provides high accuracy for semantic understanding but is expensive. A baseline of TF-IDF + XGBoost is often sufficient for keyword-heavy documents (like "Invoices"). I will start with a hybrid approach: a lightweight Transformer for the MVP.

Scaling: We need an asynchronous processing pipeline for heavy documents and a synchronous API for small text snippets.

Elite Bonus Points

Layout-Aware Embeddings: Using models like LayoutLM that incorporate 2D spatial coordinates of text tokens (bounding boxes) to distinguish between a "Date" at the top of an invoice vs. the bottom of a contract.

Cold Start for New Classes: Implementing Few-Shot Learning or using LLM-based synthetic data generation to bootstrap classification for a new document type where historical labels don't exist.

OCR Quality-Aware Inference: Feeding OCR confidence scores into the model to handle "noisy" text from low-quality scans differently than high-fidelity digital text.

Cost-Optimized Inference: Implementing a Cascaded Inference strategy: run a cheap FastText model first; if the confidence is below 0.9, route to a heavy Transformer model.

Design Breakdown

Requirements

Product Goal: Automatically categorize incoming documents to trigger specific downstream business logic (e.g., paying an invoice).

Success Metrics:

Online Metrics: Accuracy (Top-1), Throughput, Reduction in manual routing.

Offline Metrics: Macro-F1 Score (to handle class imbalance), Precision/Recall per class.

Guardrail Metrics: P99 Latency, OCR Error Rate, Cost per classification.

System Constraints: Support 10M docs/day; handle PDF/PNG/JPG/TXT formats.

Data Availability: Historical labeled dataset of 1M documents; real-time stream of incoming files via S3/Kafka.

ML Problem Framing

ML Task Type: Multi-class Classification.

Prediction Target:

P(\text{class}_i | \text{document content})

Inputs:

Textual Features: Raw text extracted via OCR or PDF parsing.

Spatial Features: Bounding boxes of words (for scanned docs).

Metadata: File size, file extension, sender domain (if email).

Outputs: A probability distribution over 50 classes.

ML Challenges:

Class Imbalance: "Invoices" might be 100x more common than "Legal Decrees".

Long Context: Standard BERT handles 512 tokens; legal docs are 50+ pages.

Noisy Labels: Human annotators often disagree on document types.

Design Summary & MVP

Concise Summary: An asynchronous pipeline that extracts text using OCR, generates embeddings via a DistilBERT model, and classifies documents using a Softmax head.

Model Architecture & Selection:

Baseline Model: TF-IDF + Logistic Regression (fast, interpretable).

Target Model: DistilBERT (Transformer).

Choice Rationale: Transformers capture semantic context better than N-grams (e.g., distinguishing "Statement of Work" from "Work Statement"). DistilBERT provides 95% of BERT's performance at 40% less latency.

Simplicity Audit: Avoided Hierarchical Attention Networks or LayoutLM for the MVP unless the data is purely visual. DistilBERT handles text-heavy enterprise docs efficiently.

System Architecture

Pipeline Deep Dive

Data Pipeline

Data Source: Documents arrive via S3 upload events or Kafka streams (e.g., from an email ingestion worker).

Data Ingestion: Asynchronous processing is mandatory due to variable OCR time. We use a message queue (SQS/Kafka) to decouple ingestion from classification.

Data Storage: Raw files in S3. Extracted text and metadata in a NoSQL Store (DynamoDB/Cassandra) for fast retrieval during inference.

Data Processing:

OCR: For images/scans, use a scalable OCR engine. For digital PDFs, use pdfplumber to extract text directly (cheaper/faster).

Cleaning: Remove PII, normalize whitespace, and handle encoding issues.

Feature Pipeline

Feature Engineering:

Textual: Sub-word tokenization (WordPiece) to handle out-of-vocabulary terms.

Structural: Page count, presence of tables, image-to-text ratio.

Online Feature Pipeline: Real-time tokenization and metadata lookup.

Offline Feature Pipeline: Batch job (Spark) to generate embeddings for the entire historical corpus to speed up training.

Training/Serving Skew: Use a shared Tokenizer Library and a shared Preprocessing Script in a Docker container to ensure training data matches production inputs exactly.

Model Architecture

Problem Formulation: Supervised Multi-class Classification.

Architecture Design:

Backbone: DistilBERT (6-layer Transformer).

Pooling Layer: Use the [CLS] token embedding as the document representation.

Output Layer: Fully connected layer with 50 neurons and Softmax activation.

Handling Long Documents:

MVP Approach: Truncate to the first 512 tokens. In most business docs, the intent is in the first page (header/title).

Advanced Approach: Sliding window with max-pooling over embeddings of different sections.

Optimization: Post-training quantization (FP16/INT8) to reduce inference latency on CPUs.

Training Pipeline

Dataset Construction: Use Stratified Sampling to ensure minority classes are represented. Apply Data Augmentation (synonym replacement, back-translation) for rare document types.

Data Splitting: 80/10/10 split. Ensure no "customer leakage" (documents from the same sender should not be in both train and test).

Retraining: Triggered monthly or when Data Drift is detected (e.g., a new invoice format from a major supplier).

Serving Pipeline

Pattern: Request-Response via a FastAPI/gRPC service.

Latency Optimization:

Batching: Group incoming requests (size 8-16) to leverage GPU/CPU vectorization.

Caching: MD5 hash the extracted text; if we see the exact same text again, return the cached result.

Reliability: If the Transformer service times out, fallback to a Heuristic Keyword Model (e.g., if "Invoice" appears 3 times, classify as Invoice).

Evaluation Pipeline

Offline: Use Confusion Matrix to identify which classes are being confused (e.g., "Contract" vs. "Amendment").

Online: A small percentage of "High Confidence" predictions are sampled for human verification to calculate "Production Accuracy."

Monitoring Pipeline

System: Monitor OCR failure rates and inference latency.

Model: Track Prediction Drift. If the model suddenly starts classifying 80% of docs as "Unknown," trigger an alert.

Data: Monitor the distribution of document lengths and languages.

Wrap Up

Final Evaluation

Observability: Use SHAP or LIME for model interpretability (e.g., "Why was this marked as a Resume?"). This is critical for business trust.

Edge Cases:

Cold Start: Use a "Miscellaneous" category for low-confidence scores.

Multi-page: Process the first and last page separately and concatenate embeddings.

Trade-offs:

Accuracy vs. Latency: We chose DistilBERT over BERT-Large to save 300ms of CPU time per request.

Complexity vs. Maintainability: Avoided multi-modal LayoutLM for MVP to keep the OCR-to-Feature pipeline simple.