The Question
Coding

Median from Data Stream

Design a data structure that supports receiving a stream of integers and efficiently retrieving the median of all elements seen so far. The MedianFinder class should implement: 1. void addNum(int num): Adds an integer to the data set. 2. double findMedian(): Returns the median of all current elements. If the number of elements is even, the median is the mean of the two middle values. Constraints: - At most 10^5 calls will be made to addNum and findMedian. - Values of integers are within the range [-10^5, 10^5]. - Optimize for both time complexity of insertion and retrieval.
C++
Heaps
Priority Queue
Questions & Insights

Clarifying Questions

Input Scale & Frequency: What is the expected number of insertions versus the frequency of median queries? (e.g., millions of insertions with occasional queries, or a 1:1 ratio?)
Value Constraints: Are the input integers within a specific range (e.g., 0-100), or can they be any arbitrary int or double? This determines if we can use a frequency array or need a heap-based approach.
Memory Limits: Should the solution handle data that exceeds RAM, or can we assume the stream fits in memory?
Dynamic Updates: Do we only support add operations, or is there a need for remove or update operations?

Assumptions

The stream fits in memory (N \approx 10^5 to 10^6 operations).
Values are standard integers.
We only need to support adding elements and retrieving the median.
If the count is even, the median is the average of the two middle elements.

Thinking Process

To find the median efficiently, we need to partition the data into two halves: the "smaller" half and the "larger" half.
The median will always be either the maximum element of the smaller half, the minimum element of the larger half, or the average of both.
Using a Max-Heap to store the smaller half and a Min-Heap to store the larger half allows us to access these "middle" elements in O(1) time.
During each insertion, we must maintain two invariants:
Ordering: Every element in the Max-Heap (left) must be \le every element in the Min-Heap (right).
Balance: The size difference between the two heaps must not exceed 1.
Implementation Breakdown

Problem Set

Functional Requirements:
void addNum(int num): Adds an integer from the data stream to the data structure.
double findMedian(): Returns the median of all elements seen so far.
Constraints:
Time Complexity: O(\log N) per addNum, O(1) per findMedian.
Space Complexity: O(N) to store all elements.

Approach

Algorithm: Two-Heap Balancing.
Data Structure: std::priority_queue (Max-Heap) and std::priority_queue with std::greater (Min-Heap).
Complexity:
Time: O(\log N) for insertion, O(1) for lookup.
Space: O(N).

Implementation

Wrap Up

Advanced Topics

Follow-up: Small Range Integers: If values are in a known range (e.g., 0-100), we can use a Frequency Array (Bucket Sort approach). We keep a count of each number and a pointer/counter to the median index, achieving O(1) insertion and O(\text{Range}) lookup.
Follow-up: 99% of Numbers in Small Range: Use the frequency array for the 0-100 range and two heaps for the outliers.
Scalability (Distributed Systems): For massive data streams where data is distributed across nodes, we use Quantile Sketches (like T-Digest or KLL sketches) to provide an approximate median with high accuracy and much lower memory footprint than O(N).
Readability vs. Performance: In C++, std::priority_queue is very efficient, but for high-performance low-latency trading, one might use a fixed-size circular buffer if the window is sliding, or a manual heap implementation to avoid allocations.