The Question
SQLRolling Window Duplicate Transaction Detection
Given a dataset of financial transactions, identify the number of 'accidental' repeat payments. A repeat payment is defined as a transaction that occurs at the same merchant, using the same credit card, for the same amount, within a 10-minute rolling window of a previous successful transaction. Ensure that only the subsequent (redundant) transactions are counted, excluding the initial valid payment from the total count.
Spark
Window Function
LAG
CTE
Timestamp Arithmetic
Questions & Insights
Clarifying Questions
What defines the "10-minute window"? Does the 10-minute limit apply strictly to the first transaction in a sequence, or is it a rolling window where each transaction is compared to the one immediately preceding it?
Assumption: We will treat it as a rolling window (i.e., if transaction B is 5 minutes after A, it's a duplicate; if transaction C is 5 minutes after B, it's also a duplicate).
Are there any unique identifiers we should ignore? Since
transaction_id is unique, we should exclude it from the grouping logic but use it to ensure we aren't comparing a row to itself.How should we handle precise timestamp differences? Are we looking for exactly \le 10 minutes (600 seconds) or < 10 minutes?
Assumption: Standard business logic usually implies "within" as inclusive (\le 10 minutes).
Schema Assumptions:
transaction_id: Primary Key (Integer).merchant_id, credit_card_id: Foreign Keys to respective dimensions (Integer).amount: Integer (representing cents or a whole currency unit).transaction_timestamp: Timestamp/Datetime.No NULLs in
merchant_id, credit_card_id, or amount (otherwise, duplicates would be logically ambiguous).Thinking Process
Step 1: Partitioning. To identify "repeats," we must group transactions that share the same
merchant_id, credit_card_id, and amount.Step 2: Sequencing. Within each group, transactions must be ordered chronologically by
transaction_timestamp.Step 3: Comparison. For every transaction (except the first one in a sequence), we need to look back at the immediately preceding transaction's timestamp.
Step 4: Calculation. In Spark SQL, subtracting two timestamps can be done by casting them to
LONG (Unix epoch in seconds) to find the difference in seconds. 10 minutes = 600 seconds.Step 5: Filtering. If the difference between the current and previous transaction is \le 600, mark it as a "repeated payment."
Step 6: Aggregation. Count the total number of records that satisfy the "repeated" flag.
Implementation Breakdown
Problem Set
Goal: Count transactions where (
merchant_id, credit_card_id, amount) are identical to a previous transaction occurring within 10 minutes.Constraint: Do not count the first transaction of the sequence.
Edge Cases:
Three transactions in a row (0m, 5m, 10m): Should count as 2 repeats.
Three transactions (0m, 11m, 12m): Should count as 1 repeat (the one at 12m).
Exact 10-minute difference (600 seconds): Included in the count.
Approach
Technologies: Spark SQL (v3.x+).
Functions:
LAG() window function to retrieve the previous row's timestamp within the partition.CAST(... AS LONG) or UNIX_TIMESTAMP() for interval math.CTE (Common Table Expressions) for readability and modularity.Execution Strategy: The query involves a Window operation which triggers an exchange (Shuffle) to colocate data by
credit_card_id, merchant_id, and amount. This is efficient provided the cardinality of these keys is high enough to prevent data skew.Implementation
Wrap Up
Advanced Topics
Data Skew: If one specific
merchant_id or credit_card_id (like a massive retail chain or a generic test card) has millions of transactions, the PARTITION BY will cause a shuffle bottleneck on a single Spark executor. To mitigate this, one could add a "salt" to the partition key, though this makes the windowing logic significantly more complex (requiring overlapping ranges).Indexing & Partitioning: In a Spark/Delta Lake environment, the table should be Z-Ordered by
credit_card_id or merchant_id to colocate related data on disk, reducing the I/O overhead during the shuffle phase.Broadcast Joins: If we were comparing these transactions against a list of "Known Refunded Merchants," we would use a broadcast join to keep the windowing operation performant.
Stateful Streaming: If this were a real-time use case (e.g., fraud detection), using Spark Structured Streaming with
mapGroupsWithState would be more appropriate than a batch SQL query to detect these duplicates as they arrive.