The Question

Hotel Review Sentiment Ranking

You are given a string positive_keywords containing space-separated words, and two arrays: hotel_ids and reviews. The arrays represent reviews for various hotels where reviews[i] corresponds to hotel_ids[i]. Your task is to rank the hotels based on the number of times any of the positive_keywords appear in their reviews. Requirements: 1. A keyword match is case-insensitive. 2. If a hotel has multiple reviews, the total count of keyword matches across all its reviews should be summed. 3. Return the list of unique hotel IDs sorted in descending order by their total match count. 4. If two hotels have the same match count, sort them by their hotel_id in ascending order. Example: Input: keywords = 'smart clean breakfast' hotel_ids = [1, 2, 1] reviews = ['This hotel is clean', 'The breakfast is good', 'The rooms are smart and clean'] Output: [1, 2] Explanation: Hotel 1 has 3 matches ('clean', 'smart', 'clean'). Hotel 2 has 1 match ('breakfast').

Java

Hashing

Sorting

HashSet

HashMap

Questions & Insights

Clarifying Questions

Input Format: How are the reviews and hotel IDs mapped? (Assumption: One-to-one mapping between hotel_ids[i] and reviews[i]).

Keyword Constraints: Are the keywords provided as a single string or a list, and is case sensitivity an issue? (Assumption: Case-insensitive, space-separated string).

Tie-breaking: If two hotels have the same number of positive keywords, how should they be ordered? (Assumption: Sort by hotel_id in ascending order).

Scalability: What are the constraints on the number of keywords (

M

) and the total length of reviews (

N

)? (Assumption:

M

up to

10^4

, total words in reviews up to

10^5

Thinking Process

Pre-processing: Store the positive keywords in a HashSet for

O(1)

average time complexity lookups. Ensure all keywords are normalized to lowercase.

Counting: Iterate through the reviews. For each review, split it into words, normalize them, and check if they exist in the keyword set. Aggregate the count for each unique hotel_id using a HashMap.

Sorting: Convert the aggregated map into a list of objects or pairs. Use a custom comparator to sort primarily by the keyword count in descending order, and secondarily by the hotel_id in ascending order.

Result Construction: Extract the sorted IDs into the final result array.

Implementation Breakdown

Problem Set

Requirements: Given a string of positive_keywords and arrays hotel_ids and reviews, return the hotel_ids sorted by the count of positive keywords found in their reviews.

Constraints:

Keywords and reviews contain alphanumeric characters and spaces.

Resulting list must be unique hotel IDs if a hotel has multiple reviews (though usually, each index is a unique review).

Time complexity should be near-linear relative to the total number of words.

Approach

Algorithm: Hashing + Custom Sorting.

Data Structure: HashSet<String> for keywords, HashMap<Integer, Integer> for scoring.

Complexity:

Time:

O(K + R \cdot W + N \log N)

, where

K

is the number of keywords,

R

is the number of reviews,

W

is the average words per review, and

N

is the number of unique hotels.

Space:

O(K + N)

to store the keywords and the hotel scores.

Implementation

Wrap Up

Advanced Topics

Trie-based Matching: If the number of keywords is massive and we need to match partial strings or prefixes, a Trie (Prefix Tree) would be more efficient than a HashSet.

Aho-Corasick Algorithm: For high-performance multi-pattern matching in a single pass of the review text, especially if keywords can contain multiple words.

Concurrency: Processing reviews can be parallelized. Use a ConcurrentHashMap or a map-reduce pattern to aggregate scores from different threads for extremely large datasets.

Normalization: In real-world scenarios (like at Booking.com), stemming (e.g., "breakfast" vs "breakfasts") and lemmatization are used to improve match accuracy.