I need to do a check vertical on my dataset in PySpark to flag only row that match some condition.
In detail: I only have to flag only row where there is an "PURCHASE + SELLER" preceded by a "SALE + CUSTOMER" (bold in the example below).
Example:
Input
| id | order_type | Initiative | date |
|---|---|---|---|
| 1 | PURCHASE | Seller | 2022-02-11 |
| 1 | PURCHASE | Seller | 2022-02-10 |
| 1 | PURCHASE | Seller | 2022-02-09 |
| 1 | SALE | Customer | 2022-02-08 |
| 1 | SALE | Customer | 2022-02-07 |
| 1 | SALE | Customer | 2022-02-06 |
| 1 | PURCHASE | Seller | 2022-02-05 |
| 1 | SALE | Customer | 2022-02-04 |
| 1 | PURCHASE | Seller | 2022-02-03 (keep attention) |
| 2 | PURCHASE | Customer | 2022-02-11 |
Output
| id | order_type | Initiative | date | flag | difference (in days) |
|---|---|---|---|---|---|
| 1 | PURCHASE | Seller | 2022-02-11 | 1 | 3 |
| 1 | PURCHASE | Seller | 2022-02-10 | 1 | 2 |
| 1 | PURCHASE | Seller | 2022-02-09 | 1 | 1 |
| 1 | SALE | Customer | 2022-02-08 | 0 | |
| 1 | SALE | Customer | 2022-02-07 | 0 | |
| 1 | SALE | Customer | 2022-02-06 | 0 | |
| 1 | PURCHASE | Seller | 2022-02-05 | 1 | 1 |
| 1 | SALE | Customer | 2022-02-04 | 0 | |
| 1 | PURCHASE | Seller | 2022-02-03 | 0 (condition is not satisfied) | |
| 2 | PURCHASE | Customer | 2022-02-11 | 0 |
here's my implementation
OUTPUTS:
final output: