The Question

Large-Scale Privacy Redaction for Geospatial Imagery

Design a machine learning system to automatically detect and blur sensitive PII, specifically faces and license plates, in a global corpus of billions of high-resolution 360-degree street-level images. The system must prioritize near-perfect recall for privacy compliance while maintaining high precision to preserve the utility of the maps. Your design should detail a distributed batch-processing architecture, handle geometric distortions in spherical imagery, and explain how to manage global variations in license plate formats and hard negatives like statues or billboards. Address the trade-offs between processing throughput, inference cost, and model accuracy at petabyte scale.

YOLOv8

Apache Beam

Feature Pyramid Network

TensorRT

Quantization

Active Learning

Inpainting

Gaussian Blur

CNN

Questions & Insights

Clarifying Questions

Business Goal: Is the primary goal legal compliance (privacy) or user trust? (Target: 100% recall for privacy while minimizing over-blurring of landmarks/signs).

Constraints & Scale: What is the scale of imagery? (Assumption: 10+ PB of raw imagery, billions of faces/plates annually). What is the throughput requirement? (Assumption: Process imagery within 24-48 hours of ingestion).

Edge Cases: How do we handle "false" faces (statues, billboards) or non-sensitive plates (store signs)? (Assumption: These should not be blurred to preserve map utility).

Image Format: Are these raw 360-degree equirectangular images or pre-cut tiles? (Assumption: High-resolution equirectangular images that require tiling for GPU memory efficiency).

Assumptions:

Corpus: Billions of images globally.

Latency: Throughput-optimized batch processing (not real-time).

Accuracy: Recall is the "North Star" (must not miss a face).

Thinking Process

Identify the Core Bottleneck: The sheer volume of pixels. Processing high-res 360 images directly is computationally prohibitive. I need a "tiling and multi-scale" approach.

Retrieval vs. Ranking: In this context, it's Detection vs. Segmentation. Detection (Bounding Boxes) is faster and sufficient for blurring.

Scale Strategy: This is a classic "embarrassingly parallel" problem. I will use a distributed batch inference pipeline (Apache Beam/Dataflow) rather than a request-response API.

Quality Control: Automated blurring is prone to drift. I need a "Human-in-the-loop" (HITL) for high-uncertainty cases and a robust regression suite.

Elite Bonus Points

Geometric Distortion Awareness: Standard CNNs struggle with equirectangular distortion (objects near poles look stretched). I would implement "spherical tiling" or coordinate-aware convolutions.

Temporal/Spatial Consistency: If a face appears in three consecutive frames as the car moves, but the model only detects it in two, we can use "Motion Interpolation" or "Spatio-temporal tracking" to fill the gap and blur the missed frame.

Edge-Case Active Learning: Using a "Hard Negative Mining" strategy to specifically train the model on statues, mannequins, and billboards to reduce false positives that ruin the "vibe" of Street View.

Privacy-Safe Evaluation: Creating a "Golden Dataset" where faces are already blurred/synthetic so that human annotators never see the raw PII (Personally Identifiable Information) during the evaluation phase.

Design Breakdown

Requirements

Product Goal: Automatically redact PII (faces and license plates) from Google Street View to comply with global privacy laws (GDPR, CCPA).

Success Metrics:

Online/Production Metrics: Recall (Percentage of PII blurred), Precision (avoiding blurring signs/landmarks).

Offline Metrics: mAP (mean Average Precision) at IoU 0.5, F1-score.

Guardrail Metrics: Inference cost per million images, Processing latency (Time-to-Map).

System Constraints: Massive storage (Petabytes), GPU-heavy workloads, variable image quality (weather/lighting).

Data Availability: Raw images from Street View cars, historical manually labeled data, synthetic data for rare plate formats.

ML Problem Framing

ML Task Type: Object Detection (2D Bounding Box detection).

Prediction Target:

P(\text{class} | \text{bounding box, image})

Inputs:

Image Features: RGB pixel data, GPS/Heading (for context), Camera Intrinsics.

Outputs: List of Bounding Boxes

[x, y, w, h]

with associated class (Face, Plate) and confidence score.

ML Challenges: High-resolution processing, extreme scale-variance (distant vs. close faces), and heavy class imbalance (most pixels are not PII).

Design Summary & MVP

Concise Summary: A distributed batch-processing pipeline that tiles 360-degree images, runs a high-performance Object Detection model (YOLO-based for speed/efficiency), and applies a Gaussian blur post-processing step on detected coordinates.

Model Architecture & Selection:

Baseline Model: Simple Haar Cascades or HOG-based detectors (Low accuracy).

Target Model: YOLOv8 or EfficientDet. These provide the best trade-off between inference speed (QPS) and mAP for small objects like distant license plates.

Choice Rationale: Single-stage detectors are significantly cheaper to run at Google scale than two-stage detectors (like Faster R-CNN) while reaching comparable recall.

Simplicity Audit: No need for real-time inference or complex Transformers. A optimized CNN-based detector on a distributed batch runner satisfies all requirements.

System Architecture

Pipeline Deep Dive

Data Pipeline

Data Source: Massive ingestion from Street View cars. Metadata includes GPS, timestamp, and camera orientation.

Data Ingestion: Use Apache Kafka to buffer ingestion events. Images are stored in a distributed blob store (GCS/S3) partitioned by geographic S2 cells for spatial locality.

Data Processing: Equirectangular images are projected into multiple rectilinear tiles (pinhole camera views) to remove distortion, which significantly improves detection accuracy for standard CNNs.

Feature Pipeline

Feature Engineering: Standardize all tiles to a fixed resolution (e.g., 640x640). Apply color space normalization.

Offline Feature Pipeline: Batch jobs compute image brightness/contrast metadata to adjust detection thresholds (e.g., more sensitive in low-light/night images).

Training/Serving Skew: Use a unified preprocessing library (C++/TensorFlow Transform) to ensure tiles are generated identically during training and batch inference.

Model Architecture

Problem Formulation: Supervised Object Detection.

Candidate Model Families:

YOLOv8: Best for speed/latency.

Faster R-CNN: Better for very small objects but 5x slower.

Architecture Design: YOLOv8 with a CSP-Darknet53 backbone and an FPN (Feature Pyramid Network). The FPN is critical because license plates can vary from 20 pixels to 500 pixels in width.

Optimization: Use TensorRT for GPU inference acceleration and INT8 quantization to reduce compute costs by ~3-4x without significant recall loss.

Training Pipeline

Dataset Construction: Focus on "Hard Negatives." We include images of statues, printed faces on buses, and "false" plates (signs) to teach the model what not to blur.

Data Splitting: Split by Location/City, not just randomly. This prevents the model from "memorizing" specific static objects seen in both train and test sets.

Retraining Strategy: Triggered monthly or when new countries are added (since license plate designs vary globally).

Serving Pipeline

Serving Pattern: Batch Inference using Apache Beam/Dataflow. This allows for massive horizontal scaling across thousands of GPU workers.

Latency Optimization: Request Batching. Accumulate tiles to fill GPU memory for maximum throughput.

Reliability: If a tile fails, retry. If it fails 3 times, send the whole 360 image to a "safeguard" queue for manual review to ensure privacy compliance.

Evaluation Pipeline

Offline Evaluation: Use mAP@0.5 and a custom Privacy Recall metric (specifically measuring the % of plates/faces with confidence > threshold that were missed).

Online Evaluation: Conduct a "Privacy Audit" on a random 1% sample of published images, where a human auditor checks for unblurred PII.

Monitoring Pipeline

Data Monitoring: Track "Detection Density." If a specific geographic region suddenly shows 0 detections, the pipeline might be broken or the camera obscured.

Model Monitoring: Monitor the distribution of confidence scores. A shift to the left indicates "Model Decay" (e.g., a new license plate design was introduced).

Wrap Up

Final Evaluation

Observability: Use dashboards to track the "Redaction Rate" per country.

Feedback Loop: Hard examples identified by auditors are fed back into the training set (Active Learning).

Edge Cases:

Cold Start: For new countries, use a "Generative Data" approach to synthesize that country's license plates onto existing images for initial training.

Over-blurring: High-precision thresholding is used for "Known Landmarks" to prevent blurring the Statue of Liberty's face.

Trade-offs: Recall vs. Precision. In privacy, we always bias toward Recall. If the model is 51% sure it's a face, we blur it.