The Question
SQLConsecutive Product Usage Streak
Given a table
filed_taxes with columns filing_id, user_id, filing_date, and product, find all users who have filed their taxes using any 'TurboTax' product for at least three consecutive years. Note that 'TurboTax' products may have different version names (e.g., 'TurboTax Basic', 'TurboTax Deluxe'). Assume each user files at most once per calendar year. Return the results sorted by user_id.PostgreSQL
CTE
Window Function
EXTRACT
ILIKE
Questions & Insights
Clarifying Questions
How is "TurboTax" identified? Does the
product string start with "TurboTax", or can it appear anywhere? (Assumption: We will use a case-insensitive prefix match LIKE 'TurboTax%' to capture "TurboTax Free", "TurboTax Deluxe", etc.).What defines the "Year" of filing? Is it the calendar year of the
filing_date or a specific tax year column? (Assumption: We will use the calendar year extracted from filing_date. We also assume a user only files once per calendar year per the prompt's constraint).How are consecutive years handled? If a user files in 2020, 2021, 2022, and 2023, do they count? (Yes, "3 or more").
Schema Assumptions:
filing_id: Primary Key (Integer).user_id: Foreign Key/Identifier (Varchar).filing_date: Timestamp/Datetime.product: Varchar (e.g., 'TurboTax Deluxe', 'H&R Block').Relationship: 1:N (One user can have many filing records across different years).
Thinking Process
Filter and Pre-process: First, we must narrow the dataset to only include TurboTax-related filings. We should also extract the
YEAR from the filing_date.Deduplication (Defensive): Although the prompt states users file once a year, in a real-world scenario, data might be messy. I will treat the data as unique per
(user_id, year) for the "consecutive" logic.Sequencing/Windowing: To find consecutive years, we have two primary methods:
Method A (LAG/LEAD): Look at the previous and the one-before-previous year for each record. If
CurrentYear - 1 == PrevYear and PrevYear - 1 == PrevPrevYear, we have a 3-year streak.Method B (Difference Grouping): Subtract a row number from the year. Constant differences indicate consecutive sequences.
Selection: For exactly "3 or more",
LAG is highly readable and performant for a fixed window of 3.Final Aggregation: Since a user might have multiple 3-year streaks (e.g., 2010-2012 and 2018-2020), we use
DISTINCT or GROUP BY to ensure each user_id is listed only once.Implementation Breakdown
Problem Set
Goal: Identify
user_ids with \ge 3 consecutive years of TurboTax usage.Constraints: Use PostgreSQL syntax. Output must be ordered by
user_id.Edge Cases:
Users with exactly 3 years.
Users with > 3 years.
Users with gaps (e.g., 2019, 2020, 2022) — these should be excluded.
Different TurboTax versions used in consecutive years (e.g., 'TurboTax Basic' in Year 1, 'TurboTax Premier' in Year 2).
Approach
CTEs: To organize the logic into discrete steps (Filtering -> Windowing -> Final Selection).
Window Functions:
LAG() to access data from preceding rows within the same user_id partition.Date Functions:
EXTRACT(YEAR FROM ...) or date_part for PostgreSQL specific year extraction.Filtering:
LIKE operator for string pattern matching.Implementation
Wrap Up
Advanced Topics
Indexing: To optimize this query, a composite index on
(product, filing_date, user_id) would be ideal. Since we filter by product first, having it as the leading column in a B-Tree index allows the engine to ignore non-TurboTax rows immediately.Gaps and Islands (Scalability): If the requirement changed to "Identify users with a streak of at least N years" (where N is large), the
LAG approach becomes cumbersome. The "Difference Grouping" method would be superior: SELECT user_id, count(*) as streak_length
FROM (
SELECT user_id, filing_year,
filing_year - ROW_NUMBER() OVER(PARTITION BY user_id ORDER BY filing_year) as grp
FROM TurboTaxFilings
) t
GROUP BY user_id, grp
HAVING count(*) >= 3Execution Plan: In PostgreSQL, the
DISTINCT in the first CTE might trigger a HashAggregate. The window functions will trigger a WindowAgg node, which usually requires the data to be sorted by user_id and filing_year. If the table is massive, this Sort could be a bottleneck (spilling to disk).