Incomplete filling when upsampling with `agg` for multiple columns (pandas resample)

595 Views Asked by At

I found this behavior of resample to be confusing after working on a related question. Here are some time series data at 5 minute intervals but with missing rows (code to construct at end):

                    user  value  total
2020-01-01 09:00:00  fred      1      1
2020-01-01 09:05:00  fred     13      1
2020-01-01 09:15:00  fred     27      3
2020-01-01 09:30:00  fred     40     12
2020-01-01 09:35:00  fred     15     12
2020-01-01 10:00:00  fred     19     16

I want to fill in the missing times using different methods for each column to fill missing data. For user and total, I want to to a forward fill, while for value I want to fill in with zeroes.

One approach I found was to resample, and then fill in the missing data after the fact:

resampled = df.resample('5T').asfreq()
resampled['user'].ffill(inplace=True)
resampled['total'].ffill(inplace=True)
resampled['value'].fillna(0, inplace=True)

Which gives correct expected output:

                     user  value  total
2020-01-01 09:00:00  fred    1.0    1.0
2020-01-01 09:05:00  fred   13.0    1.0
2020-01-01 09:10:00  fred    0.0    1.0
2020-01-01 09:15:00  fred   27.0    3.0
2020-01-01 09:20:00  fred    0.0    3.0
2020-01-01 09:25:00  fred    0.0    3.0
2020-01-01 09:30:00  fred   40.0   12.0
2020-01-01 09:35:00  fred   15.0   12.0
2020-01-01 09:40:00  fred    0.0   12.0
2020-01-01 09:45:00  fred    0.0   12.0
2020-01-01 09:50:00  fred    0.0   12.0
2020-01-01 09:55:00  fred    0.0   12.0
2020-01-01 10:00:00  fred   19.0   16.0

I thought one would be able to use agg to specify what to do by column. I try to do the following:

resampled = df.resample('5T').agg({'user':'ffill',
                                   'value':'sum',
                                   'total':'ffill'})

I find this to be more clear and simpler, but it doesn't give the expected output. The sum works, but the forward fill does not:

                     user  value  total
2020-01-01 09:00:00  fred      1    1.0
2020-01-01 09:05:00  fred     13    1.0
2020-01-01 09:10:00   NaN      0    NaN
2020-01-01 09:15:00  fred     27    3.0
2020-01-01 09:20:00   NaN      0    NaN
2020-01-01 09:25:00   NaN      0    NaN
2020-01-01 09:30:00  fred     40   12.0
2020-01-01 09:35:00  fred     15   12.0
2020-01-01 09:40:00   NaN      0    NaN
2020-01-01 09:45:00   NaN      0    NaN
2020-01-01 09:50:00   NaN      0    NaN
2020-01-01 09:55:00   NaN      0    NaN
2020-01-01 10:00:00  fred     19   16.0

Can someone explain this output, and if there is a way to achieve the expected output using agg? It seems odd that the forward fill doesn't work here, but if I were to just do resampled = df.resample('5T').ffill(), that would work for every column (but is undesired here as it would do so for the value column as well). The closest I have come is to individually run resampling for each column and apply the function I want:

resampled = pd.DataFrame()

d = {'user':'ffill',
     'value':'sum',
     'total':'ffill'}

for k, v in d.items():
    resampled[k] = df[k].resample('5T').apply(v)

This works, but feels silly given that it adds extra iteration and uses the dictionary I am trying to pass to agg! I have looked a few posts on agg and apply but can't seem to explain what is happening here:

I have also tried using groupby with a pd.Grouper and using the pd.NamedAgg class, with no luck.


Example data:

import pandas as pd

dates = ['01-01-2020 9:00', '01-01-2020 9:05', '01-01-2020 9:15',
         '01-01-2020 9:30', '01-01-2020 9:35', '01-01-2020 10:00']
dates = pd.to_datetime(dates)

df = pd.DataFrame({'user':['fred']*len(dates),
                   'value':[1,13,27,40,15,19],
                   'total':[1,1,3,12,12,16]},
                  index=dates)
0

There are 0 best solutions below