Groupby Rolling counts filtered by condition

68 Views Asked by At

I have dataset with a list of transactions in this format:

transaction_ID card_number transaction_datetime amount store
1 123 2023-06-24 12:20:24 100.0 A
2 456 2023-08-27 23:12:00 250.0 B
3 123 2023-09-02 09:00:03 416.12 A
4 123 2023-09-02 10:30:03 6580.0 C

I have created a function by doing some research online that counts for each transaction, how many transactions has that card made in a specific time range, for example in the last hour, 3 days or 6 months:

def rolling_count(df, freq):
    return (df.set_index(transaction_datetime)
            .groupby("card_number")["card_number"]
            .rolling(freq, closed="left")
            .count()
            .fillna(0)
            .values

I then use the function like this:

df["number_transactions_lastday"] = rolling_count(df, "1D")

I now need to create other features that take into account the store as well. So instead of counting all the transactions that were made with that card in the past, to just count the ones that were made on the same store.

I have seen many examples online on how to add some conditions to these type of operations but none of the solutions work on my case.

How can I add a new column to my dataframe that does rolling counts while checking if the store is the same or not?

Example:

Input:

df["number_tr_store_last6m"] = rolling_count_store(df, "180D") # so 6 months

Expected output table:

transaction_ID card_number transaction_datetime amount store number_tr_store_last6m
1 123 2023-06-24 12:20:24 100.0 A 0
2 456 2023-08-27 23:12:00 250.0 B 0
3 123 2023-09-02 09:00:03 416.12 A 1
4 123 2023-09-02 10:30:03 6580.0 C 0

My database is quite large, so the code needs to be optimized the best as possible.

2

There are 2 best solutions below

1
Karol Oleksy On

Let's see the below solution. You can use pandas.DataFrame.apply function which allows and pass multiple arguments for calculations.

import pandas as pd
from datetime import timedelta


data = {
    "transaction_ID": [1, 2, 3, 4],
    "card_number": [123, 456, 123, 123],
    "transaction_datetime": [
        "2023-06-24 12:20:24",
        "2023-08-27 23:12:00",
        "2023-09-02 09:00:03",
        "2023-09-02 10:30:03",
    ],
    "amount": [100.0, 250.0, 416.12, 6580.0],
    "store": ["A", "B", "A", "C"],
}

df = pd.DataFrame(data)
df["transaction_datetime"] = pd.to_datetime(
    df["transaction_datetime"], format="%Y/%m/%d %H:%M:%S"
)


def calculate_num_tr_store(row, days, df):
    start_date = row["transaction_datetime"] - timedelta(days=days)
    return len(
        df[
            (df["transaction_datetime"] > start_date)
            & (df["transaction_datetime"] < row["transaction_datetime"])
            & (df["store"] == row["store"])
        ].index
    )


df["number_tr_store_last6m"] = df.apply(calculate_num_tr_store, days=180, df=df, axis=1)
print(df)

Output:

transaction_ID  card_number  ... store  number_tr_store_last6m
0               1          123  ...     A                       0
1               2          456  ...     B                       0
2               3          123  ...     A                       1
3               4          123  ...     C                       0

[4 rows x 6 columns]
1
shadowpaprika On

So I managed to do it by changing a bit my original function:

    def rolling_count(df, freq):
        result = df.set_index(transaction_datetime)
                   .groupby(["card_number", "store"])["store"]
                   .rolling(freq, closed="left")
                   .count()
                   .fillna(0)
        result.index = result.index.droplevel(1)
        result = result.sort_index()
        return result

It should be easily to adapt in case anyone needs to use for their case. My data has millions of rows and it takes around 3 min to run.