The Question
SQLConsecutive High Attendance Records
You are given a table
Stadium containing stadium visit data with columns id (integer), visit_date (date), and people (integer). The id is an auto-incrementing value that corresponds with the chronological order of the visit_date. Write a SQL query to identify all sequences of three or more consecutive IDs where each visit in that sequence had a population of 100 or more people. The result should include the original row details and be ordered by the visit date.PostgreSQL
CTE
Window Function
Gaps and Islands
Questions & Insights
Clarifying Questions
Are there gaps in the `id` sequence? The problem states "consecutive IDs." In Gaps and Islands problems, we must clarify if "consecutive" means id, id+1, id+2 or simply the chronological order. Based on the prompt "As the id increases, the date increases," we will treat the integer
id as the sequence generator.Can the same date have multiple entries? The prompt states
visit_date has unique values, so it is a natural key or a unique constraint.What is the scale of the data? For a stadium tracking daily visits, the dataset is likely in the magnitude of thousands or tens of thousands of rows. Standard window functions in PostgreSQL are highly efficient for this volume.
What if a sequence is longer than 3? The requirement is "three or more." Any row belonging to a cluster of \ge 3 consecutive IDs meeting the criteria must be returned.
Assumptions:
Primary Key:
id is the unique identifier and follows a strictly increasing chronological order.Table Type: This is an Event Fact Table capturing attendance metrics per day.
Data Quality: We assume
people is non-null. If NULLs exist, they would be excluded by the >= 100 predicate.Thinking Process
To solve this "Gaps and Islands" problem, I will follow a logical progression:
Filter First: We only care about days where
people >= 100. Filtering early reduces the dataset for window function processing.Identify Islands: Once filtered, the
id values will no longer be strictly consecutive if some days had < 100 people. For example, if IDs are [1, 2, 3, 5, 6, 7], there is a gap at 4.The "Difference" Trick: To group these "islands," I will subtract a monotonic sequence (
ROW_NUMBER) from the id. For
id values 1, 2, 3, the row numbers are 1, 2, 3. Differences: 1-1=0, 2-2=0, 3-3=0.For
id values 5, 6, 7, the row numbers are 4, 5, 6. Differences: 5-4=1, 6-5=1, 7-6=1.Rows sharing the same difference belong to the same consecutive block.
Aggregate & Filter: Use another window function (
COUNT(*) OVER(...)) to count how many records exist in each island.Final Selection: Filter for counts \ge 3 and sort by
visit_date.Implementation Breakdown
Problem Set
Goal: Find blocks of \ge 3 consecutive IDs where
people >= 100.Constraint 1: Attendance threshold \ge 100.
Constraint 2: Consecutive IDs (minimum 3).
Constraint 3: Order by
visit_date.Edge Cases:
Exactly 3 rows in a sequence.
More than 3 rows (e.g., 5 rows should all be returned).
Multiple separate sequences in the same table.
Approach
CTEs (Common Table Expressions): To break the logic into readable steps (Filtering -> Grouping -> Counting).
Window Functions:
ROW_NUMBER(): To generate a sequence for the difference calculation.COUNT(...) OVER(...): To determine the size of each island without collapsing rows.PostgreSQL Specifics: Standard ANSI SQL compliance ensures this works perfectly in PostgreSQL.
Computational Cost:O(N \log N) due to the sorting required for window functions, where N is the number of rows with \ge 100 people.
Implementation
Wrap Up
Advanced Topics
Indexing: An index on
people would speed up the initial filter. However, since the final result requires an order by id (via ROW_NUMBER), a composite index on (people, id) would be ideal to allow the engine to filter and then scan in order.Performance in PostgreSQL: If the table is massive,
ROW_NUMBER() can be memory-intensive. However, for a "Stadium" table, the row count is unlikely to exceed the capacity of a standard work_mem setting.Alternative - LEAD/LAG: One could use
LEAD and LAG to check if the current row + 1 and current row + 2 meet the criteria, or current - 1 and current + 1, etc. This is often more verbose and harder to maintain than the Gaps and Islands approach, as it requires hardcoding the "3 or more" logic into multiple conditional branches.Parallelism: PostgreSQL can parallelize the
FilteredStadium scan and the window function computation if the dataset size warrants it.