The Question

Identify Retried Payment Transactions

Given a transactions table with columns transaction_id, merchant_id, credit_card_id, amount, and transaction_timestamp, write a query to identify the total number of 'accidental' repeat payments. A repeat payment is defined as a transaction that occurs at the same merchant, with the same credit card, for the exact same amount, within a 10-minute window of a previous transaction. Note: In a sequence of three identical transactions each within 10 minutes of the last, the second and third should be counted as repeats, but the first should not.

Spark

CTE

Window Function

LAG

Questions & Insights

Clarifying Questions

How should we handle a chain of duplicates? If a card is charged at 10:00, 10:05, and 10:08 for the same amount at the same merchant, are both the 10:05 and 10:08 transactions considered "repeated"?

Assumption: Yes. Any transaction that occurs within 10 minutes of its immediate predecessor (sharing the same merchant, card, and amount) is a "repeated" event. In this example, the count would be 2.

Is the 10-minute window inclusive?

Assumption: "Within 10 minutes" typically means

\Delta t \le 10

minutes.

Are there unique identifiers we should worry about?

Assumption:transaction_id is the Primary Key. We assume no two rows are identical across all columns including the ID, but the business logic for "repeated" ignores the ID.

Schema Assumptions:

transactions: Fact table containing immutable event logs.

transaction_timestamp: Spark TimestampType.

amount: Represented as an integer (likely cents) to avoid floating-point precision issues during comparison.

Thinking Process

Grouping (Partitioning): To identify duplicates, we must look at transactions belonging to the same merchant_id, credit_card_id, and amount. These form our logical partitions.

Ordering: Within these partitions, we must process transactions chronologically using transaction_timestamp.

Sequential Comparison: The core of the problem is comparing a row to its "neighbor." The LAG() window function is the most efficient way to access the preceding row's timestamp without performing an expensive self-join on a non-equi condition (which would lead to

O(N^2)

complexity per group).

Time Arithmetic: In Spark SQL, subtracting two timestamps returns an interval or we can convert them to Unix timestamps (seconds) to perform simple integer subtraction. 10 minutes = 600 seconds.

Filtering & Counting: Once we have the time difference, we filter for rows where the difference is

\le 600

seconds and count them.

Implementation Breakdown

Problem Set

Requirement: Count "repeated" transactions defined as having the same merchant, card, and amount within 10 minutes of a previous occurrence.

Constraint: The first transaction in a sequence must not be counted.

Edge Cases:

Transactions exactly 10 minutes apart (included).

Multiple repeats in a short burst (each subsequent one counts).

Empty table or no duplicates (should return 0).

Null values in credit_card_id or amount (assumed non-nullable based on schema, but window functions handle null partitions safely).

Approach

Technologies: Spark SQL (v3.x preferred).

Window Functions:LAG() to peek at the previous record's timestamp.

Window Specification:PARTITION BY merchant_id, credit_card_id, amount ORDER BY transaction_timestamp.

Time Calculation: Use unix_timestamp() or casting to LONG to calculate the delta in seconds.

Computational Cost:

O(N \log N)

due to the shuffle and sort required by the window function. This is significantly more performant than a Theta-Join.

Implementation

Wrap Up

Advanced Topics

Data Skew: If a specific merchant_id (e.g., a massive retailer) or a generic amount (e.g., $1.00) has millions of rows, the PARTITION BY will send all those rows to a single executor, causing an OOM (Out of Memory) error or severe lag. To mitigate this, one might "salt" the partition key or use a range-based join if the skew is uncontrollable.

Indexing & Partitioning: In a Spark/Delta Lake environment, ensure the data is Z-ORDERED by merchant_id or credit_card_id to improve data skipping.

Stateful Streaming: If this requirement was for real-time detection, we would use Spark Structured Streaming with flatMapGroupsWithState to maintain a state of the "last seen transaction" per card/merchant/amount, emitting an alert when the 10-minute threshold is violated.