The Question

Hotel Occupancy Forecasting System

Design a high-scale machine learning system to predict the daily occupancy rate for a global hotel chain (100k+ properties) over a 30-day horizon. The system must integrate diverse data sources including real-time 'On-the-Books' reservation data, historical occupancy trends, and external market signals like local events and competitor pricing. Your design should address time-series specific challenges such as temporal leakage, lead-time dynamics, and cold-start problems for new hotels. Focus on the end-to-end lifecycle including data ingestion, feature engineering for time-series, batch inference scalability, and strategies for handling uncertainty in forecasts.

LightGBM

XGBoost

Spark

Kafka

Delta Lake

Feast

DynamoDB

Quantile Regression

Conformal Prediction

Fourier Transforms

Questions & Insights

Clarifying Questions

Business Goal: Is the primary goal to optimize dynamic pricing, manage staffing, or provide inventory insights to hotel owners?

Assumption: The goal is to provide accurate 30-day-ahead occupancy forecasts to hotel managers for operational planning and revenue management.

Constraints & Scale: How many hotels and what is the granularity?

Assumption: Global scale (100k+ hotels), predicting at the daily level. P99 latency for inference is less critical (batch-oriented), but data freshness for the "on-the-books" (OTB) bookings is vital.

Prediction Horizon: Are we predicting for tomorrow, next week, or next year?

Assumption: A multi-horizon forecast (e.g., 1, 7, 14, and 30 days out).

Edge Cases: How do we handle new hotels or major one-time events (e.g., Olympics)?

Assumption: We will use a "Cold Start" strategy using metadata for new hotels and integrate an external "Events" API for anomalies.

Thinking Process

Identify the Core Problem: This is a time-series forecasting problem but with heavy tabular influence (holidays, pricing, local events). Pure time-series models (ARIMA) won't capture the cross-hotel patterns or external features effectively.

Selection of Approach: For an MVP at scale, Gradient Boosted Decision Trees (GBDT) like LightGBM or XGBoost are superior to Deep Learning (LSTMs/Transformers) because they handle tabular data, missing values, and varying scales of features better with less tuning.

Bottleneck Identification: The primary bottleneck is the "Lead Time" or "Booking Window." Today's occupancy is a result of bookings made over the last 6 months. The model must account for "On-the-Books" (OTB) data—current confirmed reservations for future dates.

Scaling Strategy: Use a global model (one model for many hotels) with hotel-specific embeddings or features rather than one model per hotel to leverage cross-learning and reduce maintenance overhead.

Elite Bonus Points

Quantile Regression for Uncertainty: Don't just predict a single number (e.g., 85%). Predict intervals (e.g., 80-90% with 95% confidence) to help managers understand risk in staffing.

Conformal Prediction: Use conformal prediction layers to guarantee that the true occupancy falls within the predicted range a certain percentage of the time, regardless of the underlying distribution.

Causal Pricing Integration: Address the feedback loop where the occupancy forecast informs pricing, which in turn changes the occupancy. Use instrumental variables or double ML to decorrelate price effects from organic demand.

Hierarchical Reconciliation: Ensure that room-level forecasts sum up to hotel-level forecasts, and hotel-level forecasts sum up to regional-level forecasts using MinT (Minimum Trace) reconciliation.

Design Breakdown

Requirements

Product Goal: Predict the percentage of rooms occupied for a specific hotel on a specific future date.

Success Metrics:

Online Metrics: MAPE (Mean Absolute Percentage Error) reduction, Revenue per Available Room (RevPAR) lift (if tied to pricing).

Offline Metrics: Weighted MAPE (WAPE) to prioritize high-volume hotels, RMSE, Bias (to check for consistent over/under-prediction).

Guardrail Metrics: Training time, Inference latency per hotel batch, Data drift (KS test).

System Constraints: Daily batch updates, Support for 100k hotels, Integration with property management systems (PMS) for OTB data.

Data Availability: Historical occupancy, current bookings (OTB), hotel metadata (stars, location, amenities), external events, competitor prices.

ML Problem Framing

ML Task Type: Supervised Regression.

Prediction Target: Occupancy Rate

Y \in [0, 1]

Inputs:

User/Hotel Features: Hotel ID (encoded), location, room count, average historical price, star rating.

Item/Time Features: Day of week, month, holiday flags, distance to major events.

Context Features: On-the-Books (OTB) data (number of rooms already booked for that future date as of today), search volume for that city.

Outputs: Scalar value (0.85) or quantiles ([0.81, 0.85, 0.89]).

ML Challenges: High seasonality, external shocks (pandemics, strikes), and the "Lead Time" dimension (the forecast for June 1st changes as we get closer to June 1st).

Design Summary & MVP

Concise Summary: A global Gradient Boosted Decision Tree (LightGBM) model that ingests historical occupancy lags, current "On-the-Books" reservations, and external event signals to produce a daily occupancy forecast.

Model Architecture & Selection:

Baseline Model: Last year's occupancy for the same day (S-Naive) or a simple Linear Regression on OTB + Seasonality.

Target Model: LightGBM with Multi-output regression or a recursive strategy for different horizons.

Choice Rationale: LightGBM handles non-linear relationships between price/events and occupancy efficiently and scales to millions of rows of hotel-day data.

ML Life Cycle Summary: Data is ingested from PMS logs -> Features (OTB, Lags) are stored in a Feature Store -> LightGBM is trained offline using time-series cross-validation -> Model is served via a batch scoring service -> Monitoring tracks MAPE and Data Drift.

Simplicity Audit: Avoids complex LSTMs/Transformers for the MVP as GBDTs provide better explainability (feature importance) and are easier to debug in a production environment with heterogeneous data.

System Architecture

Pipeline Deep Dive

Data Pipeline

Data Source: Property Management Systems (PMS) provide the "ground truth" (actual check-ins). External APIs provide competitive pricing and local events (concerts, conferences).

Data Ingestion: Batch ingestion for historical data; Change Data Capture (CDC) or streaming (Kafka) for real-time booking updates to keep OTB features fresh.

Data Storage: S3/HDFS for raw logs. Delta Lake for ACID transactions and time-travel (essential for debugging "what did the OTB look like 3 days ago?").

Data Processing: Spark for heavy-lift window functions (e.g., "average occupancy over the last 4 Mondays").

Feature Pipeline

Feature Definition:

Lag Features: Occupancy at

T-7, T-14, T-364

days.

OTB Features: Current rooms booked for target date

D

, booking velocity (how many new bookings in the last 24h for date

D

Temporal Features: Fourier transforms for seasonality, DayOfWeek, IsHoliday.

Feature Store: Use Feast to ensure the logic used to calculate "average occupancy" in training is identical to serving, preventing training-serving skew.

Point-in-time Joins: Critical for time-series. When training on Jan 1st data, we must only use OTB info available on or before Jan 1st.

Model Architecture

Problem Formulation: Regression for

P(Occupancy | Context)

Candidate Model Families:

Prophet/Auto-ARIMA: Good for single series, poor for cross-hotel features.

DeepAR (RNN): Good for probabilistic forecasting but expensive to train/serve.

LightGBM: Best balance. High performance on tabular features.

Architecture: A Global GBDT. Instead of 100k models, we train 1 model using Hotel_ID as a categorical feature (or target-encoded) and Region_ID to capture local trends.

Model Complexity: Use a single model for all horizons but include "Lead_Time" (days between today and target date) as a feature.

Training Pipeline

Dataset Construction: Creating a "Rectangular" dataset where each row is a (Hotel, Target_Date, Lead_Time) tuple.

Data Splitting: Time-based split is mandatory. Train on 2020-2022, validate on 2023, test on 2024. No random splitting.

Retraining Strategy: Weekly retraining to capture evolving trends in the travel industry (e.g., post-pandemic recovery).

Serving Pipeline

Serving Pattern: Batch Inference. Every night, calculate forecasts for the next 90 days for all 100k hotels.

Latency Optimization: Results are written to a low-latency NoSQL store (DynamoDB/Cassandra) for the UI to consume instantly.

Reliability: If the model fails, fall back to a "Seasonality-Adjusted OTB" heuristic (OTB + remaining historical pickup).

Evaluation Pipeline

Offline Evaluation:

MAPE: Easy for business stakeholders to understand.

Weighted MAPE: Gives more weight to large hotels where a 1% error is more costly.

Online Evaluation: Shadow deployment. Run the new model alongside the old one for 2 weeks, comparing daily MAPE against actual check-in data.

Monitoring Pipeline

System Monitoring: Track the time taken to generate 9M predictions (100k hotels * 90 days).

Data Monitoring: Monitor if the "Rooms Booked" feature suddenly drops to zero (indicates an upstream PMS integration failure).

Model Monitoring: Track Prediction Drift. If the model starts predicting 100% occupancy for everything, trigger an alert for a potential "event" feature explosion.

Wrap Up

Final Evaluation

Observability: Use "Residual Analysis" to see if errors are concentrated in specific regions or hotel types (e.g., luxury vs. budget).

Edge Cases:

Cold Start: For new hotels, use the average occupancy of the 10 nearest hotels with the same star rating.

Outliers: Cap occupancy at 100% and use robust scalers for pricing features.

Trade-offs:

Global vs. Local Models: A global model is easier to maintain but might underperform on a very unique "Boutique" hotel. We mitigate this with rich hotel-level features.

Distinguishing Insights:

Pickup Modeling: Instead of predicting total occupancy directly, predict the "Remaining Pickup" (how many more rooms will be booked between now and the target date) and add it to the current OTB. This is often more stable.