Industrial Predictive Maintenance System at Scale
Design a machine learning system to predict industrial equipment failure (e.g., turbines, motors) for a fleet of 100,000 units with a minimum 72-hour lead time. Your design must address high-throughput sensor data ingestion (5M+ events/sec), extreme class imbalance, and the requirement for high-precision alerting to avoid unnecessary maintenance costs. Detail the end-to-end flow from streaming feature engineering to distributed training and real-time inference, specifically explaining how you handle temporal leakage, sensor noise, and model explainability for field operators.
XGBoostSpark StreamingKafkaFeastSHAPFFTS3Ray
00