The Question

Finding Median Rows per Partition

Given a table Employee with columns id (INT, PK), company (VARCHAR), and salary (INT), write a SQL query to identify the specific rows that represent the median salary for each company. Rules for median calculation: 1. Sort salaries in ascending order for each company. 2. In case of identical salaries, use id as a secondary ascending sort key to break ties. 3. If a company has an odd number of employees, return the single middle row. 4. If a company has an even number of employees, return the two middle rows. The result should include the id, company, and salary columns for these specific median records.

PostgreSQL

CTE

Window Function

Questions & Insights

Clarifying Questions

How should the median be handled for an even number of records? In standard statistics, the median is the average of the two middle elements. However, the prompt asks for the "rows that contain the median salary." This implies that for a company with an even number of employees (e.g.,

N=6

), we should return the two middle records (ranks 3 and 4).

Are there any performance constraints? Given the nature of the problem, we assume the dataset could be large enough that a self-join approach (comparing each record to every other record to find the middle) would be

O(N^2)

and inefficient. We should aim for a window function approach which is generally

O(N \log N)

What defines the "middle" when sorting? The prompt explicitly states to sort by salary and break ties by id. This ensures a deterministic order, which is crucial for identifying specific "rows" as the median.

Data Model Assumptions:

Employee` Table: This is a Fact Table representing a snapshot of employee compensation.

Primary Key:id is the PK.

Relationships: 1:N relationship between company and id.

Data Quality: We assume salary and company are NOT NULL. If salary were NULL, it would affect the median calculation; typically, NULLs are excluded from such aggregate logic.

Thinking Process

Define the Order: To find a median, we first need to define the sequence. We will use ROW_NUMBER() partitioned by company and ordered by salary and id.

Determine Group Size: We need to know how many employees are in each company to calculate which row numbers correspond to the median. We can use COUNT(*) OVER(PARTITION BY company) for this.

Identify Median Indices:

Let

N

be the total count of employees in a company.

N

is odd, the median index is

(N+1)/2

N

is even, the median indices are

N/2

and

N/2 + 1

Math Logic for SQL Filtering:

A robust way to capture both odd and even middle points in a single expression is:

row_number BETWEEN count / 2.0 AND count / 2.0 + 1.

Example

N=5

: rn between 2.5 and 3.5

\rightarrow

rn = 3.

Example

N=6

: rn between 3.0 and 4.0

\rightarrow

rn = 3, 4.

Performance Strategy: Use a Common Table Expression (CTE) to calculate the window functions in a single pass over the data, then filter in the outer query. This avoids multiple scans of the base table.

Implementation Breakdown

Problem Set

Goal: Retrieve the specific records (id, company, salary) that represent the median salary for every company.

Ties: Salaries are not unique; use id as the secondary sort key to ensure a stable, unique ranking.

Even/Odd: Handle both cases (1 row for odd, 2 rows for even) without separate conditional logic if possible.

Approach

Window Functions:ROW_NUMBER() to assign an ordinal rank and COUNT() to get the partition size.

CTE: To encapsulate the windowing logic and keep the WHERE clause clean.

Algebraic Filtering: Using a numeric range comparison to handle the parity (odd/even) of the counts.

Complexity:

O(N \log N)

due to the sort required by the window partition. Space complexity is

O(N)

for the intermediate result set.

Implementation

Wrap Up

Advanced Topics

Indexing Strategy: To optimize this query in PostgreSQL, a composite index on (company, salary, id) would be ideal. This allows the database to perform an index-only scan and satisfy the PARTITION BY and ORDER BY clauses without an explicit sort step (the "Sort" node in the explain plan).

Handling Large Scale Data: If the table is massive (billions of rows), the PARTITION BY might cause a massive "Shuffle" or "Sort" operation. In a distributed PostgreSQL environment (like Citus), ensuring the table is distributed by company would keep the median calculation local to each node.

Alternative Median Methods: PostgreSQL has an aggregate function percentile_cont(0.5) or percentile_disc(0.5), but those return the value of the median, not the specific rows that constitute it. When the task is to return the underlying record (the "row"), window functions are the standard approach.

Execution Plan: The optimizer will likely use a WindowAgg node. We should check for "Spills to Disk" if the work_mem setting is too low to hold the sort for a large partition.