I'm working with pitch-by-pitch baseball data and am looking to create a number of rolling averages, a few of which are conditional based on another column. The data is 6 million rows, so ideally the method isn't too inefficient. I understand dplyr is probably not the best choice for efficiency - however would like to utilize it if at all possible. Working with 32 GB of memory so I do have a little wiggle room
Here's an example of the data:
mydata <- data.frame(pitch_type = c("FB", "SI", "CU", "FB", "CH", "FB", "FS", "SL", "FB", "CH"),
velocity = c(99, 97, 83, 97, 85, 101, 82, 84, 100, 83))
mydata
pitch_type velocity
1 FB 99
2 SI 97
3 CU 83
4 FB 97
5 CH 85
6 FB 101
7 FS 82
8 SL 84
9 FB 100
10 CH 83
In my actual code I have data grouped by pitcher, sorted chronologically, etc. but basically I just want to be able to calculate rolling averages for certain pitch types. I also cannot have NA values when pitch type is not equal to what I'm trying to calculate, I'd like for it to just revert to the previous calculation.
Here's an example of what I'm looking for. Here I want to calculate average velocity for last two observations where pitch_type == "FB":
mydata_updated
pitch_type velocity l2_velo_fb
1 FB 99 NA
2 SI 97 NA
3 CU 83 NA
4 FB 97 98.00
5 CH 85 98.00
6 FB 101 99.00
7 FS 82 99.00
8 SL 84 99.00
9 FB 100 100.5
10 CH 83 100.5
Seems relatively straightforward, but for the life of me cannot find similar examples online to what I'm looking for. I have numerous different columns and conditions to work through in creating new columns, so creating a new dataframe for each filter is far from ideal.
I did find this example:
mutate(last1000FBvelo = ifelse(pitch_type %in% c("FB"),
rollapply(release_speed, 1000, mean, fill = NA, align = 'right', na.rm = TRUE), NA),
but instead of NA I need it to revert to whatever the previous value was. Also, just a few of these columns took about 30 minutes to run.
I also realize there are a number of packages that can compute rolling averages (zoo, RccpRoll, slider, runner are ones I've came across). It's hard to tell what is the most efficient for my use case - assuming an of them can achieve the conditional filtering.
Any input is greatly appreciated
Also - not important or even necessary, but would also be awesome if there's a simple option to have a weighted moving average that weights more recent observations a little more.
I'm not certain about
99.25, but here's a method:The
if (...) zoo::rollmeanr(..) else NAis really just a small efficiency: if you don't want to calculate a rolling mean for non-FBdata, then usingif (.)beforerollmeanrkeeps us from calculating and discarding the results. (That's also the reason I group bypitch_type.)I think it's incorrect because a simple 2-wide rolling right-mean is:
The use of
.by=is predicated ondplyr_1.1.0or newer. If you have older, then change to