Groupby Rolling counts filtered by condition

Question

Groupby Rolling counts filtered by condition

68 Views Asked by shadowpaprika At 14 March 2024 at 15:43

I have dataset with a list of transactions in this format:

transaction_ID	card_number	transaction_datetime	amount	store
1	123	2023-06-24 12:20:24	100.0	A
2	456	2023-08-27 23:12:00	250.0	B
3	123	2023-09-02 09:00:03	416.12	A
4	123	2023-09-02 10:30:03	6580.0	C

I have created a function by doing some research online that counts for each transaction, how many transactions has that card made in a specific time range, for example in the last hour, 3 days or 6 months:

def rolling_count(df, freq):
    return (df.set_index(transaction_datetime)
            .groupby("card_number")["card_number"]
            .rolling(freq, closed="left")
            .count()
            .fillna(0)
            .values

I then use the function like this:

df["number_transactions_lastday"] = rolling_count(df, "1D")

I now need to create other features that take into account the store as well. So instead of counting all the transactions that were made with that card in the past, to just count the ones that were made on the same store.

I have seen many examples online on how to add some conditions to these type of operations but none of the solutions work on my case.

How can I add a new column to my dataframe that does rolling counts while checking if the store is the same or not?

Example:

Input:

df["number_tr_store_last6m"] = rolling_count_store(df, "180D") # so 6 months

Expected output table:

transaction_ID	card_number	transaction_datetime	amount	store	number_tr_store_last6m
1	123	2023-06-24 12:20:24	100.0	A	0
2	456	2023-08-27 23:12:00	250.0	B	0
3	123	2023-09-02 09:00:03	416.12	A	1
4	123	2023-09-02 10:30:03	6580.0	C	0

My database is quite large, so the code needs to be optimized the best as possible.

Original Q&A

There are 2 best solutions below

**Karol Oleksy** · Answer 1 · 2024-03-14T17:46:15.390000

Let's see the below solution. You can use pandas.DataFrame.apply function which allows and pass multiple arguments for calculations.

import pandas as pd
from datetime import timedelta


data = {
    "transaction_ID": [1, 2, 3, 4],
    "card_number": [123, 456, 123, 123],
    "transaction_datetime": [
        "2023-06-24 12:20:24",
        "2023-08-27 23:12:00",
        "2023-09-02 09:00:03",
        "2023-09-02 10:30:03",
    ],
    "amount": [100.0, 250.0, 416.12, 6580.0],
    "store": ["A", "B", "A", "C"],
}

df = pd.DataFrame(data)
df["transaction_datetime"] = pd.to_datetime(
    df["transaction_datetime"], format="%Y/%m/%d %H:%M:%S"
)


def calculate_num_tr_store(row, days, df):
    start_date = row["transaction_datetime"] - timedelta(days=days)
    return len(
        df[
            (df["transaction_datetime"] > start_date)
            & (df["transaction_datetime"] < row["transaction_datetime"])
            & (df["store"] == row["store"])
        ].index
    )


df["number_tr_store_last6m"] = df.apply(calculate_num_tr_store, days=180, df=df, axis=1)
print(df)

Output:

transaction_ID  card_number  ... store  number_tr_store_last6m
0               1          123  ...     A                       0
1               2          456  ...     B                       0
2               3          123  ...     A                       1
3               4          123  ...     C                       0

[4 rows x 6 columns]

**shadowpaprika** · Answer 2 · 2024-03-24T10:06:49.287000

So I managed to do it by changing a bit my original function:

    def rolling_count(df, freq):
        result = df.set_index(transaction_datetime)
                   .groupby(["card_number", "store"])["store"]
                   .rolling(freq, closed="left")
                   .count()
                   .fillna(0)
        result.index = result.index.droplevel(1)
        result = result.sort_index()
        return result

It should be easily to adapt in case anyone needs to use for their case. My data has millions of rows and it takes around 3 min to run.

Groupby Rolling counts filtered by condition

There are 2 best solutions below

Related Questions in PYTHON-3.X

Related Questions in PANDAS

Related Questions in GROUP-BY

Related Questions in ROLLING-COMPUTATION

Trending Questions

Popular # Hahtags

Popular Questions