Having trouble using rolling on a dataframe without using apply, which is slow

176 Views Asked by At

I have a dataframe that looks as follows:

ID        Date        Prize     IfWon
1         01-01-20      5         1
2         01-01-20      8         1
1         01-03-20      3         0
1         01-04-20      10        1
1         01-07-20      5         0
2         01-10-20      5         1
3         01-10-20      10        1

I want to add a new column that, for a given ID will include the sum of any prizes they won in the 7 days prior to that date and NOT including that date. The goal is to have a dataframe that looks like this:

ID        Date        Prize     IfWon    PrevWon
1         01-01-20      5         1         0
2         01-01-20      8         1         0
1         01-03-20      3         0         5
1         01-04-20      10        1         5
1         01-07-20      5         0         15
2         01-10-20      5         1         0
3         01-10-20      10        1         0

The code I have to do it is below, and it works, but I'm having two issues with it:

def get_rolling_prize_sum(grp, freq):
    return grp.rolling(freq, on = 'Date', closed = 'right')['CurrentWon'].sum()

processed_data_df['CurrentWon'] = processed_data_df['Prize'] * processed_data_df['IfWon'] # gets deleted later
processed_data_df['PrevWon'] = processed_data_df.groupby('ID', group_keys=False).apply(get_rolling_prize_sum, '7D').astype(float) - processed_data_df['CurrentWon']
  1. Because I don't want to include the prize from the current day, I was trying to close the rolling on the right side, however this isn't working (e.g. taking out closed = 'right' above will do the exact same thing). As a result of this, I end up doing the subtraction in the last line.
  2. The actual DB I'm using is huge and I need to do many of these rolling sums at various points, but it's stupidly slow. I was told that I can do this without .apply, using .rolling directly, but I'm unable to get that to work properly. My attempt is below, with error, and I'll note that the error took several minutes to produce with this being the only significant computation, so it seems as if it's doing part of it and then failing later:
# Not using closed right here, just subtracting
processed_data_df['PrevWon'] = processed_data_df.groupby('ID', group_keys=False).rolling('7D', on = 'Date')['CurrentWon'].sum() - processed_data_df['CurrentWon']

ValueError: cannot join with no overlapping index names

Any ideas?

1

There are 1 best solutions below

2
On BEST ANSWER

Improved previous answer and managed to tackle sorting of groupby

df = pd.read_csv("data.csv")
df["Date"] = pd.to_datetime(df['Date'], format='%m-%d-%y')
df["CurrentWon"] = df["Prize"] * df["IfWon"]

result = df.groupby("ID").rolling("7D", on = 'Date', closed = 'right').CurrentWon.sum().reset_index()
result.rename(columns={"CurrentWon": "PreviousWon"}, inplace=True)
df = df.merge(result, on=["ID", "Date"])
df["PreviousWon"] -= df["CurrentWon"]
print(df)

Outputs:

   ID       Date  Prize  IfWon  CurrentWon  PreviousWon
0   1 2020-01-01      5      1           5          0.0
1   2 2020-01-01      8      1           8          0.0
2   1 2020-01-03      3      0           0          5.0
3   1 2020-01-04     10      1          10          5.0
4   1 2020-01-07      5      0           0         15.0
5   2 2020-01-10      5      1           5          0.0
6   3 2020-01-10     10      1          10          0.0