The Question
ML DesignLarge-Scale Privacy Redaction for Geospatial Imagery
Design a machine learning system to automatically detect and blur sensitive PII, specifically faces and license plates, in a global corpus of billions of high-resolution 360-degree street-level images. The system must prioritize near-perfect recall for privacy compliance while maintaining high precision to preserve the utility of the maps. Your design should detail a distributed batch-processing architecture, handle geometric distortions in spherical imagery, and explain how to manage global variations in license plate formats and hard negatives like statues or billboards. Address the trade-offs between processing throughput, inference cost, and model accuracy at petabyte scale.
YOLOv8
Apache Beam
Feature Pyramid Network
TensorRT
Quantization
Active Learning
Inpainting
Gaussian Blur
CNN
Questions & Insights
Clarifying Questions
Business Goal: Is the primary goal legal compliance (privacy) or user trust? (Target: 100% recall for privacy while minimizing over-blurring of landmarks/signs).
Constraints & Scale: What is the scale of imagery? (Assumption: 10+ PB of raw imagery, billions of faces/plates annually). What is the throughput requirement? (Assumption: Process imagery within 24-48 hours of ingestion).
Edge Cases: How do we handle "false" faces (statues, billboards) or non-sensitive plates (store signs)? (Assumption: These should not be blurred to preserve map utility).
Image Format: Are these raw 360-degree equirectangular images or pre-cut tiles? (Assumption: High-resolution equirectangular images that require tiling for GPU memory efficiency).
Assumptions:
Corpus: Billions of images globally.
Latency: Throughput-optimized batch processing (not real-time).
Accuracy: Recall is the "North Star" (must not miss a face).
Thinking Process
Identify the Core Bottleneck: The sheer volume of pixels. Processing high-res 360 images directly is computationally prohibitive. I need a "tiling and multi-scale" approach.
Retrieval vs. Ranking: In this context, it's Detection vs. Segmentation. Detection (Bounding Boxes) is faster and sufficient for blurring.
Scale Strategy: This is a classic "embarrassingly parallel" problem. I will use a distributed batch inference pipeline (Apache Beam/Dataflow) rather than a request-response API.
Quality Control: Automated blurring is prone to drift. I need a "Human-in-the-loop" (HITL) for high-uncertainty cases and a robust regression suite.
Elite Bonus Points
Geometric Distortion Awareness: Standard CNNs struggle with equirectangular distortion (objects near poles look stretched). I would implement "spherical tiling" or coordinate-aware convolutions.
Temporal/Spatial Consistency: If a face appears in three consecutive frames as the car moves, but the model only detects it in two, we can use "Motion Interpolation" or "Spatio-temporal tracking" to fill the gap and blur the missed frame.
Edge-Case Active Learning: Using a "Hard Negative Mining" strategy to specifically train the model on statues, mannequins, and billboards to reduce false positives that ruin the "vibe" of Street View.
Privacy-Safe Evaluation: Creating a "Golden Dataset" where faces are already blurred/synthetic so that human annotators never see the raw PII (Personally Identifiable Information) during the evaluation phase.
Design Breakdown
Requirements
Product Goal: Automatically redact PII (faces and license plates) from Google Street View to comply with global privacy laws (GDPR, CCPA).
Success Metrics:
Online/Production Metrics: Recall (Percentage of PII blurred), Precision (avoiding blurring signs/landmarks).
Offline Metrics: mAP (mean Average Precision) at IoU 0.5, F1-score.
Guardrail Metrics: Inference cost per million images, Processing latency (Time-to-Map).
System Constraints: Massive storage (Petabytes), GPU-heavy workloads, variable image quality (weather/lighting).
Data Availability: Raw images from Street View cars, historical manually labeled data, synthetic data for rare plate formats.
ML Problem Framing
ML Task Type: Object Detection (2D Bounding Box detection).
Prediction Target: P(\text{class} | \text{bounding box, image}).
Inputs:
Image Features: RGB pixel data, GPS/Heading (for context), Camera Intrinsics.
Outputs: List of Bounding Boxes [x, y, w, h] with associated class (Face, Plate) and confidence score.
ML Challenges: High-resolution processing, extreme scale-variance (distant vs. close faces), and heavy class imbalance (most pixels are not PII).
Design Summary & MVP
Concise Summary: A distributed batch-processing pipeline that tiles 360-degree images, runs a high-performance Object Detection model (YOLO-based for speed/efficiency), and applies a Gaussian blur post-processing step on detected coordinates.
Model Architecture & Selection:
Baseline Model: Simple Haar Cascades or HOG-based detectors (Low accuracy).
Target Model: YOLOv8 or EfficientDet. These provide the best trade-off between inference speed (QPS) and mAP for small objects like distant license plates.
Choice Rationale: Single-stage detectors are significantly cheaper to run at Google scale than two-stage detectors (like Faster R-CNN) while reaching comparable recall.
Simplicity Audit: No need for real-time inference or complex Transformers. A optimized CNN-based detector on a distributed batch runner satisfies all requirements.
System Architecture
Pipeline Deep Dive
Data Pipeline
Data Source: Massive ingestion from Street View cars. Metadata includes GPS, timestamp, and camera orientation.
Data Ingestion: Use Apache Kafka to buffer ingestion events. Images are stored in a distributed blob store (GCS/S3) partitioned by geographic S2 cells for spatial locality.
Data Processing: Equirectangular images are projected into multiple rectilinear tiles (pinhole camera views) to remove distortion, which significantly improves detection accuracy for standard CNNs.
Feature Pipeline
Feature Engineering: Standardize all tiles to a fixed resolution (e.g., 640x640). Apply color space normalization.
Offline Feature Pipeline: Batch jobs compute image brightness/contrast metadata to adjust detection thresholds (e.g., more sensitive in low-light/night images).
Training/Serving Skew: Use a unified preprocessing library (C++/TensorFlow Transform) to ensure tiles are generated identically during training and batch inference.
Model Architecture
Problem Formulation: Supervised Object Detection.
Candidate Model Families:
YOLOv8: Best for speed/latency.
Faster R-CNN: Better for very small objects but 5x slower.
Architecture Design: YOLOv8 with a CSP-Darknet53 backbone and an FPN (Feature Pyramid Network). The FPN is critical because license plates can vary from 20 pixels to 500 pixels in width.
Optimization: Use TensorRT for GPU inference acceleration and INT8 quantization to reduce compute costs by ~3-4x without significant recall loss.
Training Pipeline
Dataset Construction: Focus on "Hard Negatives." We include images of statues, printed faces on buses, and "false" plates (signs) to teach the model what not to blur.
Data Splitting: Split by Location/City, not just randomly. This prevents the model from "memorizing" specific static objects seen in both train and test sets.
Retraining Strategy: Triggered monthly or when new countries are added (since license plate designs vary globally).
Serving Pipeline
Serving Pattern: Batch Inference using Apache Beam/Dataflow. This allows for massive horizontal scaling across thousands of GPU workers.
Latency Optimization: Request Batching. Accumulate tiles to fill GPU memory for maximum throughput.
Reliability: If a tile fails, retry. If it fails 3 times, send the whole 360 image to a "safeguard" queue for manual review to ensure privacy compliance.
Evaluation Pipeline
Offline Evaluation: Use mAP@0.5 and a custom Privacy Recall metric (specifically measuring the % of plates/faces with confidence > threshold that were missed).
Online Evaluation: Conduct a "Privacy Audit" on a random 1% sample of published images, where a human auditor checks for unblurred PII.
Monitoring Pipeline
Data Monitoring: Track "Detection Density." If a specific geographic region suddenly shows 0 detections, the pipeline might be broken or the camera obscured.
Model Monitoring: Monitor the distribution of confidence scores. A shift to the left indicates "Model Decay" (e.g., a new license plate design was introduced).
Wrap Up
Final Evaluation
Observability: Use dashboards to track the "Redaction Rate" per country.
Feedback Loop: Hard examples identified by auditors are fed back into the training set (Active Learning).
Edge Cases:
Cold Start: For new countries, use a "Generative Data" approach to synthesize that country's license plates onto existing images for initial training.
Over-blurring: High-precision thresholding is used for "Known Landmarks" to prevent blurring the Statue of Liberty's face.
Trade-offs: Recall vs. Precision. In privacy, we always bias toward Recall. If the model is 51% sure it's a face, we blur it.