The Question
Coding

Award Top K Hotels

Given a set of positive and negative keywords, and a collection of hotel reviews associated with specific hotel IDs, implement a system to rank and return the top K hotels. A positive keyword match increases a hotel's score by a fixed weight, while a negative match decreases it. In case of tie scores, hotels should be ordered by their ID. The solution should handle text normalization and provide efficient ranking for large datasets.
Java
HashSet
HashMap
Sorting
String Parsing
Questions & Insights

Clarifying Questions

How should we handle punctuation and case sensitivity?
Assumption: Keywords and reviews should be treated as case-insensitive, and punctuation in reviews (like commas or periods) should be ignored or replaced with spaces to ensure words are matched correctly.
How are ties handled if two hotels have the same total score?
Assumption: If scores are equal, the hotel with the smaller ID should be ranked higher (lexicographical/numerical order).
What are the input constraints?
Assumption: The number of reviews ($N$) and the length of each review ($L$) could be large, so we need an efficient way to lookup keywords, ideally $O(1)$ per word.
Can a keyword appear multiple times in a single review?
Assumption: Yes, every occurrence of a positive keyword adds 3 points, and every negative keyword subtracts 1 point.

Thinking Process

Keyword Lookup Optimization: To avoid nested loops, store the positive and negative keywords in HashSet structures. This allows O(1) average-time complexity for checking if a word from a review contributes to the score.
Score Aggregation: Iterate through the reviews and calculate the score for each. Since multiple reviews can belong to the same hotel ID, use a HashMap<Integer, Integer> to accumulate the total score for each hotel.
Custom Sorting: Once the total scores are aggregated, transform the map entries into a list and sort them using a custom comparator: primary sort by score descending, secondary sort by hotel ID ascending.
Result Extraction: Retrieve the first K hotel IDs from the sorted list.
Implementation Breakdown

Problem Set

Functional Requirements:
Input: String positive_keywords, String negative_keywords, int[] hotel_ids, String[] reviews, int k.
Output: List<Integer> containing the top K hotel IDs.
Scoring: +3 for each positive word, -1 for each negative word.
Constraints:
Time Complexity should ideally be O(W + N \cdot L + M \log M), where W is total keywords, N is number of reviews, L is words per review, and M is unique hotels.
Space Complexity: O(W + M) to store keywords and hotel scores.

Approach

Algorithm: String Tokenization & Custom Sorting.
Data Structure: HashSet<String> for keywords, HashMap<Integer, Integer> for scoring.
Complexity:
Time: O(W + \sum \text{review lengths} + M \log M)
Space: O(W + M)

Implementation

Wrap Up

Advanced Topics

Readability vs. Performance: In a production environment, regex replaceAll can be slow. A manual character-by-character scan of the review strings would be more performant for high-throughput systems.
Scaling to Distributed Systems: If the number of reviews is massive (millions), this problem can be solved using MapReduce. The "Map" phase calculates scores per review, and the "Reduce" phase aggregates scores per hotelId.
Memory Optimization: If K is much smaller than the number of unique hotels, using a Min-PriorityQueue of size K would optimize the sorting phase from O(M \log M) to O(M \log K).
Concurrency: Processing reviews can be parallelized using Java Parallel Streams since each review's score calculation is independent.