dplyr dynamically create lag and ma features

571 Views Asked by At

I am trying to create a process that takes in a dataframe and creates additional lagged and rolling window features (e.g. moving average). This is what I have so far.

# dummy dataframe
n <- 20
set.seed(123)
foo <- data.frame(
  date = seq(as.Date('2020-01-01'),length.out = n, by = 'day'),
  var1 = sample.int(n),
  var2 = sample.int(n))

# creates lags and based on (some of) them creates rolling average features
foo %>% 
  mutate_at(vars(starts_with('var')),
            funs(lag_1 = lag(.), lag_2 = lag(.,2))) %>% 
  mutate_at(vars(contains('lag_1')),
            funs(ra_3 = rollmean(., k = 3, align = 'right', fill = NA)))

The above chunk :

  1. creates lag01,lag02 features considering the selected variables
  2. based on a subset of the newly created columns, creates rolling average features

What I am now looking for, is to create an arbitrary number of lagged features (e.g. lag3,lag6,lag9 so on) as well as create an arbitrary number of rolling average features (of different window length - i.e. var1_lag_1_ra_3, var1_lag_1_ra_6, var2_lag_1_ra_3, var2_lag_1_ra_6. At the moment the settings to generate such features are hardcoded. Ideally I would have couple of vectors to adjust the outcome; like so:

lag_features <- c(3,6,9)
ma_features <- c(12,15)

Lastly, it would be quite nice if there was a way to configure the names of the generated features in a dynamic manner. I 've seen {{}},!!,:= operators, but I am not really in a position to tell the difference or how to use them.

I have also implemented the above using some readily available functions from the timetk package, but since I am looking for some additional flexibility, I was wondering how I could replicate such behavior myself.

library(timetk)
foo %>% 
  select(date,starts_with('var')) %>%
  tk_augment_lags(.value = starts_with("var"),
                  .lags = 1) %>% 
  tk_augment_slidify(.value   = ends_with("lag1"),
                     .period  = seq(0,24,3)[-1],
                     .f       = mean,
                     .align   = 'right', 
                     .partial = TRUE
  )

Any support would be really appreciated.

1

There are 1 best solutions below

2
On

You can use the map function to get the lagged value for variable numbers. We can use the .names argument in across to provide names to new columns.

library(dplyr)
library(purrr)
library(zoo)

lag_features <- c(3,6,9)
ma_features <- c(12,15)

foo <- bind_cols(foo, map_dfc(lag_features, ~foo %>% 
                         transmute(across(starts_with('var'), 
                                          lag, .x, .names = '{col}_lag{.x}'))),
                map_dfc(ma_features, ~foo %>%
                        transmute(across(contains('lag3'), rollmeanr, k = .x, 
                             fill = NA, .names = '{col}_{.x}'))))