Getting rolling average of multiple column by multiple condition, with dplyr and apply family

954 Views Asked by At

I'm performing an analysis on basketball data. This is how my dataset looks like (a really exemplified version of it):

df<-data.frame(gmID = 1:20, 
               H.Team = c("CLE", "MIA", "LAL", "PHI", "CLE", "DET", "CHI", "DAL", "UTA", "PHO", "POR", "WAS", "ORL", "CHA", "BOS", "CHI", "ATL", "DAL", "CLE", "WAS"),
               A.Team = c("WAS", "BOS", "DAL", "DEN", "IND", "HOU", "SAC", "WAS", "DAL", "CLE", "LAL", "OKC", "DEN", "IND", "MIL", "CLE", "HOU", "MIA", "UTA", "DEN"),
               H.PTS = c(94, 120, 91, 84, 88, 96, 93, 95, 113, 85, 116, 86, 102, 90, 88, 86, 102, 104, 88, 111),
               A.PTS = c(84, 107, 99, 75, 90, 105, 87, 99, 94, 87, 106, 84, 89, 89, 99, 115, 109, 84, 86, 88),
               H.AST = c(22, 25, 24, 18, 18, 21, 21, 26, 24, 16, 19, 27, 24, 18, 22, 23, 23, 27, 23, 25),
               A.AST = c(26, 24, 22, 19, 22, 28, 14, 22, 20, 19, 21, 18, 22, 19, 26, 34, 22, 18, 19, 16))
df
   gmID H.Team A.Team H.PTS A.PTS H.AST A.AST
1     1    CLE    WAS    94    84    22    26
2     2    MIA    BOS   120   107    25    24
3     3    LAL    DAL    91    99    24    22
4     4    PHI    DEN    84    75    18    19
5     5    CLE    IND    88    90    18    22
6     6    DET    HOU    96   105    21    28
7     7    CHI    SAC    93    87    21    14
8     8    DAL    WAS    95    99    26    22
9     9    UTA    DAL   113    94    24    20
10   10    PHO    CLE    85    87    16    19
11   11    POR    LAL   116   106    19    21
12   12    WAS    OKC    86    84    27    18
13   13    ORL    DEN   102    89    24    22
14   14    CHA    IND    90    89    18    19
15   15    BOS    MIL    88    99    22    26
16   16    CHI    CLE    86   115    23    34
17   17    ATL    HOU   102   109    23    22
18   18    DAL    MIA   104    84    27    18
19   19    CLE    UTA    88    86    23    19
20   20    WAS    DEN   111    88    25    16

To simplify the problem, I selected 20 rows from the original dataset and just 2 pair of game stats, points (PTS) and assist (AST) made by home (H.) and away (A.) team (the are 50 more pairs of game stats in my data). Note that rows are sorted by date, so they don't need to get arranged.

The goal is to perform a prediction of the game result (1 if H.Team wins, 0 if A.Team wins), however the form the data is in now is useless, because the stats are reported once the match ended.

So, the idea is to replace each value of the data frame with the rolling mean of the previous n match. I will set n=7 or n=10 in my work, but that doesn't really matter now, so to simplify I will set n=3.

The output should look like this:

   gmID H.Team A.Team   H.PTSav   A.PTSav  H.ASTav  A.ASTav
1     1    CLE    WAS        NA        NA       NA       NA
2     2    MIA    BOS        NA        NA       NA       NA
3     3    LAL    DAL        NA        NA       NA       NA
4     4    PHI    DEN        NA        NA       NA       NA
5     5    CLE    IND  94.00000        NA 22.00000       NA
6     6    DET    HOU        NA        NA       NA       NA
7     7    CHI    SAC        NA        NA       NA       NA
8     8    DAL    WAS  99.00000  84.00000 22.00000 26.00000
9     9    UTA    DAL        NA  97.00000       NA 24.00000
10   10    PHO    CLE        NA  91.00000       NA 20.00000
11   11    POR    LAL        NA  91.00000       NA 24.00000
12   12    WAS    OKC  91.50000        NA 24.00000       NA
13   13    ORL    DEN        NA  75.00000       NA 19.00000
14   14    CHA    IND        NA  90.00000       NA 22.00000
15   15    BOS    MIL 107.00000        NA 24.00000       NA
16   16    CHI    CLE  93.00000  89.66667 21.00000 19.66667
17   17    ATL    HOU        NA 105.00000       NA 28.00000
18   18    DAL    MIA  96.00000 120.00000 22.66667 25.00000
19   19    CLE    UTA  96.66667 113.00000 23.66667 24.00000
20   20    WAS    DEN  89.66667  82.00000 25.00000 20.50000

For example, for team CLE, that played 5 games, the average PTS values are the following:

   gmID    avpts
1    1       NA    ---> NA
2    5 94.00000    ---> 94/1 
3   10 91.00000    ---> (94+88)/2 
4   16 89.66667    ---> (94+88+87)/3 
5   19 96.66667    ---> (88+87+115)/3 

I used dplyr and in particular zoo::rollaply function to get the values displayed above, the code is the following:

library(dplyr)
library(zoo)

sub<- df %>%
  filter(H.Team == "CLE" | A.Team == "CLE") %>%
  mutate(avpts = lag(rollapply(ifelse(H.Team == "CLE", H.PTS,A.PTS), width=3, FUN=mean, align="right", fill=NA, partial=1))) %>%
  select(gmID,avpts)
sub

I did this just for a team and a variable, but I could easily do it for more variables specifying it in mutate(), like this:

 mutate(avpts = lag(rollapply(ifelse(H.Team == "CLE", H.PTS,A.PTS), width=3, FUN=mean, align="right", fill=NA, partial=1)),
     avast = lag(rollapply(ifelse(H.Team == "CLE", H.AST,A.AST), width=3, FUN=mean, align="right", fill=NA, partial=1)))

The problem is that I should do this for other 50 variables, and above all I need to calculate values for all the teams, not just for one. Furthermore I obtained a column with the right values, but I don't know how to replace them in their "correct" position.

My idea to (partially) solve the problem is to wrap the code above in a function and then use another function from the apply family to get values for all the teams, without using a for loop.

I wrote the following function:

avstats<- function(team) {
  sub <- df %>%
    filter(.data$H.Team == !!team | .data$A.Team == !!team) %>%
    mutate(avpts = lag(rollapply(ifelse(H.Team == !!team, .data$H.PTS, .data$A.PTS),3,mean,align="right",fill=NA,partial=1))) %>%
    select(.data$gmID, .data$avpts)
}

Finally i used lapply() through a list of the teams that are in this small dataset.

teams <- c("CLE", "MIA", "LAL", "PHI", "DET", "CHI", "DAL", "UTA", "PHO", "POR", "WAS", "ORL", "CHA", "BOS", "ATL", "DEN", "IND", "HOU", "SAC", "OKC","MIL")

lapply(teams,avstats)

And everything seems working good with the two functions.

But there are still two main question that I'd like to get answered:

  • How can I get the values for all the others variable without writing a row for each stat? (i.e. how to get average assist, turnovers, ecc..)

  • How to "combine" and "put in the right place" the new average values generated in such a way that the original data structure are still the same?

Maybe I should modify my function avstats adding some arguments and as consequence use another apply() function such as mapply(), but i really don't know how to do.

1

There are 1 best solutions below

3
On BEST ANSWER

Do you want this? (mean_run from library(runner) used).

  • You can automate this process for as many variables you want. Just use their names in .cols argument of mutate(across...
  • To change rolling window size just change k in mean_run as per choice.
df %>% pivot_longer(!gmID, names_to = c("H_T", ".value"),
                    names_pattern = "(.+)\\.(.+)") %>%
  group_by(Team) %>%
  mutate(across(.cols = c(PTS, AST), 
                ~ runner::mean_run(x = ., k = 3, lag = 1), 
                .names = '{.col}_av')) %>%
  pivot_wider(id_cols = gmID, 
              names_from = H_T, 
              names_glue = "{H_T}_{.value}", 
              values_from = -c(gmID, H_T))

# A tibble: 20 x 11
    gmID H_Team A_Team H_PTS A_PTS H_AST A_AST H_PTS_av A_PTS_av H_AST_av A_AST_av
   <int> <chr>  <chr>  <dbl> <dbl> <dbl> <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
 1     1 CLE    WAS       94    84    22    26     NA       NA       NA       NA  
 2     2 MIA    BOS      120   107    25    24     NA       NA       NA       NA  
 3     3 LAL    DAL       91    99    24    22     NA       NA       NA       NA  
 4     4 PHI    DEN       84    75    18    19     NA       NA       NA       NA  
 5     5 CLE    IND       88    90    18    22     94       NA       22       NA  
 6     6 DET    HOU       96   105    21    28     NA       NA       NA       NA  
 7     7 CHI    SAC       93    87    21    14     NA       NA       NA       NA  
 8     8 DAL    WAS       95    99    26    22     99       84       22       26  
 9     9 UTA    DAL      113    94    24    20     NA       97       NA       24  
10    10 PHO    CLE       85    87    16    19     NA       91       NA       20  
11    11 POR    LAL      116   106    19    21     NA       91       NA       24  
12    12 WAS    OKC       86    84    27    18     91.5     NA       24       NA  
13    13 ORL    DEN      102    89    24    22     NA       75       NA       19  
14    14 CHA    IND       90    89    18    19     NA       90       NA       22  
15    15 BOS    MIL       88    99    22    26    107       NA       24       NA  
16    16 CHI    CLE       86   115    23    34     93       89.7     21       19.7
17    17 ATL    HOU      102   109    23    22     NA      105       NA       28  
18    18 DAL    MIA      104    84    27    18     96      120       22.7     25  
19    19 CLE    UTA       88    86    23    19     96.7    113       23.7     24  
20    20 WAS    DEN      111    88    25    16     89.7     82       25       20.5