DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
The Question
SQL

Consecutive SaaS Subscription Retention

Given a table filed_taxes with columns filing_id, user_id, filing_date, and product, identify users who have demonstrated loyalty by using any version of the 'TurboTax' product suite for at least three consecutive calendar years. Note that 'TurboTax' may appear in the product column under various names (e.g., 'TurboTax Deluxe 2023', 'TurboTax Free Edition'). A user may have multiple entries, but only one filing per year should be considered for the continuity count. Output the list of unique user IDs sorted numerically/alphabetically.
Clickhouse
Window Function
CTE
Gaps and Islands
Vectorized Execution
Questions & Insights

Clarifying Questions

What defines "any version of TurboTax"? Does this imply a prefix match (e.g., TurboTax%), a case-insensitive search, or is there a specific dimension table for products?
Assumption: We will use a case-insensitive prefix match (ILIKE 'TurboTax%') to capture "TurboTax Free", "TurboTax Deluxe", etc.
How is "Year" defined? Is it the calendar year of the filing_date?
Assumption: We will extract the year from the filing_date.
Can a user file for multiple tax years in a single calendar year? (e.g., filing back taxes for 2021 and 2022 in 2023).
Assumption: The problem implies the continuity of the filing event years. If a user files in 2021, 2022, and 2023, they meet the criteria regardless of which tax year they were filing for, though usually, these align.
Data Volume & Engine: ClickHouse is an OLAP database. Is the table partitioned by date?
Assumption: The table is large. We should focus on efficient column reads and leverage ClickHouse's vectorized execution.
Data Model Assumptions:
filed_taxes is a Fact Table (Event grain: one row per filing).
user_id is a String/LowCardinality(String).
filing_date is a DateTime.
There are no NULL values in user_id or filing_date for valid filings.

Thinking Process

Filter and Transform: First, filter for "TurboTax" products. Extract the year from filing_date. Since a user might technically file multiple times in a year (e.g., an amendment), we should deduplicate to one record per (user_id, year).
Identify Continuity (Gaps and Islands): To find consecutive years, we use the standard "difference from row number" technique:
Sort filings by user_id and year.
Subtract a monotonic sequence (ROW_NUMBER()) from the year.
If years are consecutive (e.g., 2020, 2021, 2022), the difference (2020-1, 2021-2, 2022-3) remains constant (2019).
Aggregation: Group by the user_id and this "Group Identifier" (the difference). Count the occurrences.
Final Filter: Retain only those groups where the count is \ge 3.
Output: Return unique user_ids in ascending order.
Implementation Breakdown

Problem Set

Goal: Find user_ids with \ge 3 consecutive years of TurboTax use.
Constraints: ClickHouse syntax, ascending order of output.
Edge Cases:
Users filing twice in one year (handled by DISTINCT).
Gaps in years (e.g., 2020, 2021, 2023) should not count as 3 consecutive.
Multiple versions of TurboTax in different years (handled by LIKE).

Approach

Technologies: ClickHouse SQL.
Functions:toYear() for date extraction, row_number() window function, ILIKE for pattern matching.
CTE Usage: To keep the logic modular and readable.
Execution Strategy: ClickHouse will perform a partial sort for the window function. Because we filter by product name first, the working set for the window function is significantly reduced.

Implementation

Wrap Up

Advanced Topics

Indexing: In ClickHouse, the ORDER BY clause in the CREATE TABLE statement (the Primary Key) should ideally include user_id if this query is frequent. This allows ClickHouse to skip entire granules of data.
Optimization: If the dataset is massive, using row_number() can be memory-intensive. ClickHouse-specific functions like neighbor(filing_year, -1) or groupArray + arraySort could potentially be used to check for current_year - 1 == previous_year, which sometimes performs better in ClickHouse's vectorized engine than standard window functions.
Memory Management: For extremely large DISTINCT or GROUP BY operations, we might need to adjust max_bytes_before_external_group_by to allow the query to spill to disk if RAM is constrained.