The Question
ML DesignVideo Anomaly Detection System
Design a large-scale, automated video surveillance system capable of identifying and alerting on unusual or suspicious activities in real-time. The solution should handle thousands of concurrent camera streams, minimize false positives through human-in-the-loop feedback, and function efficiently across varying environments and lighting conditions.
CNN
Autoencoder
LSTM
YOLO/Faster R-CNN
Optical Flow
Questions & Insights
Clarifying Questions
Business Goal: Is the primary metric minimizing "Time to Detect" (Latency), "False Alarm Rate" (Precision), or "Missed Critical Events" (Recall)?
Constraints & Scale: How many cameras (e.g., 100,000)? What is the frame rate and resolution (e.g., 1080p @ 30fps)? Is processing done at the Edge, Cloud, or Hybrid?
Edge Cases: How do we handle drastic lighting changes (day/night), occlusions, or camera jitter? How do we define "anomaly" in a way that generalizes across different scenes (e.g., a person running in a park vs. a person running in a bank)?
Assumptions: I assume a scale of 50,000 cameras, a latency requirement of <2 seconds from event to alert, and a hybrid architecture where basic motion filtering happens at the edge, while complex anomaly classification happens in the cloud.
Thinking Process
The Bottleneck: Video data is massive. Sending 100k raw streams to the cloud is cost-prohibitive. The system must use a hierarchical filtering approach: Edge (Motion/Object detection) -> Cloud (Temporal Analysis/Anomaly Ranking).
The ML Strategy: Anomalies are rare by definition (class imbalance). Supervised learning is difficult due to lack of labels. I should frame this as a Weakly Supervised Learning problem using Multiple Instance Learning (MIL) or a Self-Supervised reconstruction task.
Scaling the Inference: To handle high QPS, I need to use temporal windowing (clips instead of frames) and efficient backbones (Video Swin Transformers or X3D) optimized with TensorRT.
Feedback Loop: Human-in-the-loop is critical. Security guards dismissing alerts should act as a negative feedback signal to retrain the model and reduce false positives.
Elite Bonus Points
Spatio-Temporal Graph Neural Networks (ST-GNNs): Instead of just pixels, model the relationships between detected objects (e.g., "Person A moving toward restricted Area B at high velocity").
Adaptive Thresholding: Using a per-camera, per-hour-of-day baseline. A crowd is "normal" at noon in a lobby but an "anomaly" at 3 AM.
Online Domain Adaptation: Using "Continual Learning" to let the model adapt to a new camera installation's specific background signals without catastrophic forgetting.
Privacy-Preserving Inference: Implementing on-device face blurring or differential privacy to ensure compliance with GDPR/CCPA before data hits the cloud.
Design Breakdown
Functional Reqs
Real-time Alerts: System triggers an alert to a dashboard when a high-probability anomaly is detected.
Video Summarization: Provide a "highlight" clip of the anomalous event.
Dashboard: Allow operators to confirm or reject anomalies (Labeling).
Non-Functional Reqs
Low Latency: End-to-end detection in under 2 seconds.
High Availability: 99.99% uptime; security monitoring cannot go down.
Scalability: Ability to onboard thousands of new cameras without linear cost increases.
Cost Efficiency: Intelligent "shaping" of video traffic to minimize egress costs.
ML Problem Framing
ML Objective: Maximize the Area Under the ROC (AUC-ROC) for anomaly detection while keeping the False Alarm Rate (FAR) below 0.1% per camera-day.
ML Category: Weakly Supervised Classification (Multiple Instance Learning) or Out-of-Distribution (OOD) Detection.
Input/Output/Label:
Input: A sequence of T video frames (H \times W \times C).
Output: An anomaly score s \in [0, 1].
Label: Binary (1 = Anomaly, 0 = Normal). In MIL, we label a whole video clip as anomalous, even if we don't know the exact frame.
Data Prep & Features
Data Pipeline: RTSP streams ingested via Kafka/Flink.
Feature Engineering:
Low-level: Optical Flow (motion vectors), HOG (gradients).
Deep Features: Embeddings from a pre-trained I3D (Inflated 3D ConvNet) or Video Swin Transformer.
Contextual: Time of day, Camera ID, Weather metadata.
Feature Store: Store camera-specific "Normalcy Embeddings" representing the statistical baseline of a specific scene.
Model Architecture
Backbone: X3D or SlowFast networks for efficient spatio-temporal feature extraction.
Anomaly Head: A Multiple Instance Learning (MIL) MLP.
Divide video into segments (bags).
The model predicts scores for each segment.
Loss is computed based on the max-scoring segment in an anomalous video vs. the max-scoring segment in a normal video (Ranking Loss).
Refinement: Use an Autoencoder to detect anomalies via reconstruction error (high error = anomaly).
Training & Serving
Training: Distributed training on 3D video chunks. Use Contrastive Learning (e.g., VideoMoCo) to learn robust features from unlabeled video.
Serving:
Stage 1 (Edge): Motion detection/YOLOv8 to filter out empty scenes.
Stage 2 (Cloud): Triton Inference Server serving the MIL model on batches of clips.
Position Bias: Adjust for camera angle/height via coordinate encoding.
System Architecture
Pipeline Deep Dive
Data Pipeline
Ingestion: Use Kafka with a retention policy of 24 hours. Video is chunked into 5-second segments (TS files).
Storage: Use a tiered approach. Raw video in S3 Glacier for long-term compliance; extracted embeddings in a Hot layer for retraining.
Processing: Flink handles the streaming "windowing" (e.g., gathering 32 frames for a 3D CNN input).
Feature Pipeline
Extraction: Compute Spatio-temporal interest points.
Feature Store: Consistency is key. The offline store (for training) and online store (for inference) must use the same preprocessing logic (Normalization, Resize).
Lineage: Every feature is tagged with the version of the Backbone Model used to generate it.
Training Pipeline
Strategy: Use a "Teacher-Student" setup. A large 3D Transformer (Teacher) generates pseudo-labels for a smaller, faster MobileNetV3-3D (Student) for inference.
Orchestration: Airflow DAGs trigger weekly retraining or whenever "Feedback Labels" reach a certain threshold.
Serving Pipeline
Two-Stage Approach:
Motion Filter: If 95% of pixels are static, skip inference (saves 90% GPU cost).
Full Scoring: Pass moving segments through the MIL-Ranker.
Calibration: Use Platt Scaling to ensure the "Anomaly Score" actually reflects the probability of a real event.
Evaluation Pipeline
Interleaved Testing: Deploy a new model alongside the old one and compare which alerts the operator clicks on more frequently.
Metrics: Track Mean Time to Detection (MTTD).
Monitoring Pipeline
Concept Drift: Monitor the average anomaly score per camera. If a camera's score shifts from 0.1 to 0.8 suddenly, it's likely a hardware fault or a scene change (construction), not a real anomaly.
Wrap Up
Advanced Topics
Offline Metrics: AUC-ROC, Precision-Recall Curve, and F1-score on a curated "Golden Dataset" of staged anomalies.
Online Metrics: False Alarms per Camera per Day (Target < 0.1) and Successful Detection Rate.
Deployment: Canary rollout. Deploy to 5 cameras first, monitor GPU thermals and memory leaks, then ramp to the full fleet.
Failure Modes:
Adversarial Attacks: Someone wearing a specific pattern to "hide" from the model. Mitigation: Ensemble models and physical security.
Catastrophic Forgetting: Ensure the model still detects "old" anomalies while learning "new" ones.
Scalability Audit: For 10x growth, we would move more feature extraction logic into the camera firmware (Edge AI) using ONNX Runtime to reduce cloud bandwidth costs by 95%.