How does the 'on' parameter in pandas.Rolling work?

62 Views Asked by At

I'm trying to apply an aggregation function to a DataFrame based on a periodic column, for example a date :

df = pd.DataFrame({"values": [3,2,1,1000,1,3,1],"day":[1,2,3,1,2,3,1]})
df
   values  day
0     3    1
1     2    2
2     1    3
3  1000    1
4     1    2
5     3    3
6     1    1

I'd like to find the max of 2 consecutive "day 1", "day 2" and so on. I got the result I wanted using this line, which first sorts the values according to the day column:

df.sort_values("day").rolling(2).sum().sort_index()

I thought this was the behavior of the "on" parameter in te DataFrame.rolling method, since according to the pandas documentation:

on : str, optional

For a DataFrame, a column label or Index level on which to calculate the rolling window, rather than the DataFrame’s index.

df.rolling(2,on="day").max()

However, when I try that, the only effect it has is to ignore the column in the calculation, but there is no re-sorting.

   values  day
0     NaN    1
1     3.0    2
2     2.0    3
3  1000.0    1
4  1000.0    2
5     3.0    3
6     3.0    1

Why does sorting the DataFrame work ? When calculating the rolling window, I thought it would only take into account the value of the index (unchanged after sorting) and not its location in the dataframe.

1

There are 1 best solutions below

0
Oliver Longhurst On BEST ANSWER

To answer your implicit first question: Why was the effect it had to ignore the column in the calculation?

Pandas docs:

on : str, optional ... Provided integer column is ignored and excluded from result since an integer index is not used to calculate the rolling window.

I suspect day is an integer column, and thus was ignored.

For your second question: Why does sorting work, is that not how the on parameter is supposed to behave?

the on parameter allows you to specify a column, and Pandas assumes it is already sorted.

What is the on parameter meant for?

Pandas uses the index by default, but if, for example, you have a DataFrame with a datetime column different from the index and you want to calculate a 7-day rolling average based on dates in that column, you can specify this column with the on parameter.

That is the gist