Pandas rolling window statistics calculation with input data with uneven timestamps

410 Views Asked by At

quick background:

  • this can relate to pandas rolling, resample, asfreq, fillna
  • this is based on processing timeseries data so I want to use pandas offset (e.g., '1T', '5min', etc.) as an input to methods.
  • also, I'm applying forward looking window for rolling method so there is this reverse of dataframe applied in the code.

what I want to achieve:

  • basically rolling window statistics calculation and resampling done in a way shown below. In other words, 5 minute rolling window (e.g., rolling window for 07:00 => 07:00-07:05) with 1 minute increment of statistics (e.g., min, max, etc.) calculation.
  • input data shown below
ts                          var1
2022-07-12 07:00:00+00:00   NaN
2022-07-12 07:01:53+00:00   1.5
2022-07-12 07:03:17+00:00   2.2
2022-07-12 07:04:02+00:00   0.9
2022-07-12 07:04:59+00:00   3.6
2022-07-12 07:05:00+00:00   NaN
2022-07-12 07:06:22+00:00   3.3
2022-07-12 07:09:46+00:00   2.3
2022-07-12 07:10:00+00:00   NaN
2022-07-12 07:11:22+00:00   1.3
2022-07-12 07:13:44+00:00   4.3
2022-07-12 07:14:26+00:00   4.1
2022-07-12 07:15:00+00:00   NaN
  • what I want
ts                          var1_max    var1_min
2022-07-12 07:00:00+00:00   3.6         0.9
2022-07-12 07:01:00+00:00   3.6         0.9
2022-07-12 07:02:00+00:00   3.6         0.9
2022-07-12 07:03:00+00:00   3.6         0.9
2022-07-12 07:04:00+00:00   3.6         0.9
2022-07-12 07:05:00+00:00   3.3         2.3
2022-07-12 07:06:00+00:00   3.3         2.3
2022-07-12 07:07:00+00:00   2.3         1.3
2022-07-12 07:08:00+00:00   2.3         1.3
2022-07-12 07:09:00+00:00   4.3         1.3
2022-07-12 07:10:00+00:00   4.3         1.3
2022-07-12 07:11:00+00:00   4.3         1.3
2022-07-12 07:12:00+00:00   4.3         4.1
2022-07-12 07:13:00+00:00   4.3         4.1
2022-07-12 07:14:00+00:00   4.1         4.1
2022-07-12 07:15:00+00:00   NaN         NaN
  • so far, I'm getting
ts                          var1_max    va1_min
2022-07-12 07:00:00+00:00   3.6         0.9
2022-07-12 07:01:00+00:00   3.6         0.9
2022-07-12 07:02:00+00:00   NaN         NaN
2022-07-12 07:03:00+00:00   3.6         0.9
2022-07-12 07:04:00+00:00   3.6         0.9
2022-07-12 07:05:00+00:00   3.3         2.3
2022-07-12 07:06:00+00:00   3.3         2.3
2022-07-12 07:07:00+00:00   NaN         NaN
2022-07-12 07:08:00+00:00   NaN         NaN
2022-07-12 07:09:00+00:00   4.3         1.3
2022-07-12 07:10:00+00:00   4.3         1.3
2022-07-12 07:11:00+00:00   4.3         1.3
2022-07-12 07:12:00+00:00   NaN         NaN
2022-07-12 07:13:00+00:00   4.3         4.1
2022-07-12 07:14:00+00:00   4.1         4.1
2022-07-12 07:15:00+00:00   NaN         NaN

key issues:

  • as you can see from the data that I'm getting, I can't seem to correctly fill values for those spaces with NaN
  • basically, those NaN should be informed by rolling window calculation from the raw data
  • I tried doing resample (or asfreq) before the rolling window calculation, but that loses some valuable information along the process when I have two values within 1 minute time frame. For example, because the input data has values of 3.6 and 0.9 between 07:04-07:05, doing resampling will not capture min (=0.9) and max (=3.6) properly at the end.

example code:

#########################################################################
# input data
data_test = [
        [
            "01:00:00",
            "01:01:53",
            "01:03:17",
            "01:04:02",
            "01:04:59",
            "01:05:00",
            "01:06:22",
            "01:09:46",
            "01:10:00",
            "01:11:22",
            "01:13:44",
            "01:14:26",
            "01:15:00"
        ],
        [np.nan, 1.5, 2.2, 0.9, 3.6, np.nan, 3.3, 2.3, np.nan, 1.3, 4.3, 4.1, np.nan]
    ]

data_test = pd.DataFrame(data_test).T
data_test.columns = ['ts', 'var1']
data_test['var1'] = data_test['var1'].astype(float)
data_test['ts'] = pd.to_datetime(data_test.ts)
data_test = data_test.set_index('ts')
    
#########################################################################
# reversing for forward looking window operation
data_test = data_test[::-1]
    
#########################################################################
# adding rolling window statistics: minimum
mins = (
    data_test.rolling(
        window="5min", 
        min_periods=1,
        closed='right'
    ).min().add_suffix("_min")
)

# adding rolling window statistics: maximum
maxs = (
    data_test.rolling(
        window="5min", 
        min_periods=1,
        closed='right'
    ).max().add_suffix("_max")
)
    
#########################################################################
# resampling with even interval
mins = mins.resample(rule="1min").min()
maxs = maxs.resample(rule="1min").max()

data_test = pd.concat([mins, maxs], axis=1)   
    
2

There are 2 best solutions below

6
BeRT2me On

Given the following, where I've already converted ts to datetime and made it the index:

                           var1
ts
2022-07-12 07:00:00+00:00   NaN
2022-07-12 07:01:53+00:00   1.5
2022-07-12 07:03:17+00:00   2.2
2022-07-12 07:04:02+00:00   0.9
2022-07-12 07:04:59+00:00   3.6
2022-07-12 07:05:00+00:00   NaN
2022-07-12 07:06:22+00:00   3.3
2022-07-12 07:09:46+00:00   2.3
2022-07-12 07:10:00+00:00   NaN
2022-07-12 07:11:22+00:00   1.3
2022-07-12 07:13:44+00:00   4.3
2022-07-12 07:14:26+00:00   4.1
2022-07-12 07:15:00+00:00   NaN

Doing:

df['var1_min'] = df.resample('5min')['var1'].transform('min')
df['var1_max'] = df.resample('5min')['var1'].transform('max')
df = df.asfreq('1min', method='ffill')

Output:

                           var1  var1_min  var1_max
2022-07-12 07:00:00+00:00   NaN       0.9       3.6
2022-07-12 07:01:00+00:00   NaN       0.9       3.6
2022-07-12 07:02:00+00:00   1.5       0.9       3.6
2022-07-12 07:03:00+00:00   1.5       0.9       3.6
2022-07-12 07:04:00+00:00   2.2       0.9       3.6
2022-07-12 07:05:00+00:00   NaN       2.3       3.3
2022-07-12 07:06:00+00:00   NaN       2.3       3.3
2022-07-12 07:07:00+00:00   3.3       2.3       3.3
2022-07-12 07:08:00+00:00   3.3       2.3       3.3
2022-07-12 07:09:00+00:00   3.3       2.3       3.3
2022-07-12 07:10:00+00:00   NaN       1.3       4.3
2022-07-12 07:11:00+00:00   NaN       1.3       4.3
2022-07-12 07:12:00+00:00   1.3       1.3       4.3
2022-07-12 07:13:00+00:00   1.3       1.3       4.3
2022-07-12 07:14:00+00:00   4.3       1.3       4.3
2022-07-12 07:15:00+00:00   NaN       NaN       NaN
0
geekshock On

I think I found a solution to generate the output that I want. But unsure if this is the most elegant way. In short, it is basically (1) using resample and agg methods to resample data with respect to each statistic (min, max, sum, count) and (2) apply rolling method to each statistic again to get rolling window statistics. To calculate the rolling mean/average, I'm dividing rolling sum by rolling count.

#########################################################################
# resampling with agg method to retain statistics within time increment
data_resample_min = data_test.resample(rule='1min').agg(['min'])
data_resample_max = data_test.resample(rule='1min').agg(['max'])
data_resample_sum = data_test.resample(rule='1min').agg(['sum'])
data_resample_count = data_test.resample(rule='1min').agg(['count'])

#########################################################################
# reversing dataframe for forward-looking rolling window operation
data_resample_min = data_resample_min[::-1]
data_resample_max = data_resample_max[::-1]
data_resample_sum = data_resample_sum[::-1]
data_resample_count = data_resample_count[::-1]

#########################################################################
# merging multiindex column header
data_resample_min.columns = data_resample_min.columns.map('_'.join)
data_resample_max.columns = data_resample_max.columns.map('_'.join)
data_resample_sum.columns = data_resample_sum.columns.map('_'.join)
data_resample_count.columns = data_resample_count.columns.map('_'.join)
    
#########################################################################   
# adding rolling window statistics: minimum
mins = (
    data_resample_min.rolling(
        window='5min', 
        min_periods=1,
        center=False,
        closed='right'
    ).min()
)
# adding rolling window statistics: maximum
maxs = (
    data_resample_max.rolling(
        window='5min', 
        min_periods=1,
        center=False,
        closed='right'
    ).max()
)
# adding rolling window statistics: sum
sums = (
    data_resample_sum.rolling(
        window='5min', 
        min_periods=1,
        center=False,
        closed='right'
    ).sum()
)
# adding rolling window statistics: count
counts = (
    data_resample_count.rolling(
        window='5min', 
        min_periods=1,
        center=False,
        closed='right'
    ).sum() # this has to be sum for proper count calculation
)
# adding rolling window statistics: mean
means = sums.copy()
means.columns = means.columns.str.replace("_sum","_mean")
# supress/hide the warning
np.seterr(invalid='ignore')
means.loc[:, :] = sums.values/counts.values

#########################################################################
# creating summary dataframe
data_test = pd.concat([mins, maxs, means], axis=1)  
data_test = data_test[::-1]