I'm trying to apply an aggregation function to a DataFrame based on a periodic column, for example a date :
df = pd.DataFrame({"values": [3,2,1,1000,1,3,1],"day":[1,2,3,1,2,3,1]})
df
values day
0 3 1
1 2 2
2 1 3
3 1000 1
4 1 2
5 3 3
6 1 1
I'd like to find the max of 2 consecutive "day 1", "day 2" and so on. I got the result I wanted using this line, which first sorts the values according to the day column:
df.sort_values("day").rolling(2).sum().sort_index()
I thought this was the behavior of the "on" parameter in te DataFrame.rolling method, since according to the pandas documentation:
on : str, optional
For a DataFrame, a column label or Index level on which to calculate the rolling window, rather than the DataFrame’s index.
df.rolling(2,on="day").max()
However, when I try that, the only effect it has is to ignore the column in the calculation, but there is no re-sorting.
values day
0 NaN 1
1 3.0 2
2 2.0 3
3 1000.0 1
4 1000.0 2
5 3.0 3
6 3.0 1
Why does sorting the DataFrame work ? When calculating the rolling window, I thought it would only take into account the value of the index (unchanged after sorting) and not its location in the dataframe.
To answer your implicit first question: Why was the effect it had to ignore the column in the calculation?
Pandas docs:
I suspect day is an integer column, and thus was ignored.
For your second question: Why does sorting work, is that not how the on parameter is supposed to behave?
the on parameter allows you to specify a column, and Pandas assumes it is already sorted.
What is the on parameter meant for?
Pandas uses the index by default, but if, for example, you have a DataFrame with a datetime column different from the index and you want to calculate a 7-day rolling average based on dates in that column, you can specify this column with the on parameter.
That is the gist