Vectorize loops when calculating rolling means with variable amounts of data

Question

Vectorize loops when calculating rolling means with variable amounts of data

128 Views Asked by EdwardTabbey At 26 September 2025 at 18:08

I have a dataframe of daily water chemistry values taken from deployed sensors. I’m trying to calculate rolling 7 day averages of daily maximum values. This in in-situ environmental data, the data can be a bit messy.

Here are the rules for calculating the averages and assigning quality levels:

Data is graded and given a quality value (DQL) for the day (dyDQL).
'A' is high quality, 'B' is medium, and 'E' is poor.
7 day average is calculated at the end of a 7 day period.
Dataset needs only 6 complete days to calculate a 7 day average (Can miss 1 day of data)
If there are at least 6 days worth ‘A’ and ‘B’ graded data and 1 day of ‘E’, discard ‘E’ data and calculate the 7-day average using the 6 days of ‘A’ and ‘B’ data

I have the code working using a loop that loops through each result, creates a new dataframe containing the 7 day window, and then calculates the moving average. See minimal example below.

Note that there are missing dates for the 11th, 16th, 17th, and 18th in this example:

  daily_data <- tibble::tribble(
    ~Monitoring.Location.ID,        ~date, ~dyMax, ~dyMin, ~dyDQL,
    "River 1", as.Date("2018-07-01"), 24.219, 22.537,    "A",
    "River 1", as.Date("2018-07-02"), 24.557, 20.388,    "A",
    "River 1", as.Date("2018-07-03"), 24.847, 20.126,    "A",
    "River 1", as.Date("2018-07-04"), 25.283, 20.674,    "A",
    "River 1", as.Date("2018-07-05"), 25.501, 20.865,    "A",
    "River 1", as.Date("2018-07-06"),  25.04, 21.008,    "A",
    "River 1", as.Date("2018-07-07"), 24.847, 20.674,    "A",
    "River 1", as.Date("2018-07-08"), 23.424, 20.793,    "B",
    "River 1", as.Date("2018-07-09"), 22.657, 18.866,    "E",
    "River 1", as.Date("2018-07-10"), 22.298,   18.2,    "A",
    "River 1", as.Date("2018-07-12"),  22.92, 19.008,    "A",
    "River 1", as.Date("2018-07-13"), 23.978, 19.532,    "A",
    "River 1", as.Date("2018-07-14"), 24.508, 19.936,    "A",
    "River 1", as.Date("2018-07-15"), 25.137, 20.627,    "A",
    "River 1", as.Date("2018-07-19"), 24.919, 20.674,    "A"
  )
  
for (l in seq_len(nrow(daily_data))){
  
  
  station_7day <- filter(daily_data,
                         dplyr::between(date, daily_data[[l,'date']] - lubridate::days(6), daily_data[l,'date']))
  
  
  daily_data[l,"ma.max7"] <- dplyr::case_when(nrow(subset(station_7day, dyDQL %in% c("A")))== 7 & l >=7  ~ mean(station_7day$dyMax),
                                              nrow(subset(station_7day, dyDQL %in% c("A", 'B'))) >= 6 & l >=7~ mean(station_7day$dyMax),
                                              max(station_7day$dyDQL == 'E') & nrow(subset(station_7day, dyDQL %in% c("A", "B"))) >= 6  & l >=7 ~ mean(station_7day$dyMax[station_7day$dyDQL %in% c("A", "B")]),
                                              nrow(subset(station_7day, dyDQL %in% c("A", "B", "E"))) >= 6 & l >=7~  mean(station_7day$dyMax),
                                              TRUE ~ NA_real_)
  
  daily_data[l, "ma.max7_DQL"] <-  dplyr::case_when(nrow(subset(station_7day, dyDQL %in% c("A")))== 7 & l >=7  ~ "A",
                                                    nrow(subset(station_7day, dyDQL %in% c("A", 'B'))) >= 6 & l >=7~ "B",
                                                    max(station_7day$dyDQL == 'E') & nrow(subset(station_7day, dyDQL %in% c("A", "B"))) >= 6  & l >=7 ~ "B",
                                                    nrow(subset(station_7day, dyDQL %in% c("A", "B", "E"))) >= 6 & l >=7~ "E",
                                                    TRUE ~ NA_character_)
  
  
}

The expected results are:

tibble::tribble(
  ~Monitoring.Location.ID,        ~date, ~dyMax, ~dyMin, ~dyDQL,         ~ma.max7, ~ma.max7_DQL,
                "River 1", as.Date("2018-07-01"), 24.219, 22.537,    "A",               NA,           NA,
                "River 1", as.Date("2018-07-02"), 24.557, 20.388,    "A",               NA,           NA,
                "River 1", as.Date("2018-07-03"), 24.847, 20.126,    "A",               NA,           NA,
                "River 1", as.Date("2018-07-04"), 25.283, 20.674,    "A",               NA,           NA,
                "River 1", as.Date("2018-07-05"), 25.501, 20.865,    "A",               NA,           NA,
                "River 1", as.Date("2018-07-06"),  25.04, 21.008,    "A",               NA,           NA,
                "River 1", as.Date("2018-07-07"), 24.847, 20.674,    "A", 24.8991428571429,          "A",
                "River 1", as.Date("2018-07-08"), 23.424, 20.793,    "B", 24.7855714285714,          "B",
                "River 1", as.Date("2018-07-09"), 22.657, 18.866,    "E", 24.5141428571429,          "B",
                "River 1", as.Date("2018-07-10"), 22.298,   18.2,    "A",            24.15,          "B",
                "River 1", as.Date("2018-07-12"),  22.92, 19.008,    "A",           23.531,          "E",
                "River 1", as.Date("2018-07-13"), 23.978, 19.532,    "A",           23.354,          "E",
                "River 1", as.Date("2018-07-14"), 24.508, 19.936,    "A",          23.2975,          "E",
                "River 1", as.Date("2018-07-15"), 25.137, 20.627,    "A",           23.583,          "E",
                "River 1", as.Date("2018-07-19"), 24.919, 20.674,    "A",               NA,           NA
  )

The code works fine, but is very slow when calculating values for multi-year levels of data with multiple different water quality parameters at multiple locations.

Due to the fact that a 7 day value can be calculated from 6 days of data, I don’t think I can use any of the rolling functions from the zoo package. I don’t think I can use the roll_mean function from the roll package, due to the variable nature of discarding 1 days worth of ‘E’ data when there is 6 days of ‘A’ or ‘B’ data.

Is there way to vectorize this, in order to avoid looping through every row of data?

Original Q&A

There are 2 best solutions below

Jon Spring On 16 April 2021 at 23:21

Here's a vectorized approach using slider:slide_index to calculate the high quality and backup quality values, then combine for best available:

library(tidyverse); library(slider)

The following function groups by location, calculates the weekly mean (including everything from date-6 to and including date) and number of observations included, then filters to just having 6+ observations.

get_weekly_by_loc <- function(df) {
  df %>%
    group_by(Monitoring.Location.ID) %>%
    mutate(mean = slide_index_dbl(dyMax, date, mean, .complete = TRUE, .before = lubridate::days(6)),
           count = slide_index_dbl(dyMax, date, ~sum(!is.na(.)), .before = lubridate::days(6))) %>%
    ungroup() %>%
    filter(count >= 6)
}

Then we can run this function on just A/B data and overall:

daily_data_high_quality <- daily_data %>%
  filter(dyDQL %in% c("A", "B")) %>%
  get_weekly_by_loc() %>%
  select(Monitoring.Location.ID, date, high_qual_mean = mean)
  
daily_data_backup <- daily_data %>%
  get_weekly_by_loc() %>%
  select(Monitoring.Location.ID, date, backup_mean = mean)

Then join those and use the high quality if available:

daily_data %>%
  left_join(daily_data_high_quality) %>%
  left_join(daily_data_backup) %>%
  mutate(max7_DQL = coalesce(high_qual_mean, backup_mean)) %>%
  mutate(moar_digits = format(max7_DQL, nsmall = 6))

Result

# A tibble: 15 x 9
   Monitoring.Location.ID date       dyMax dyMin dyDQL high_qual_mean backup_mean max7_DQL moar_digits
   <chr>                  <date>     <dbl> <dbl> <chr>          <dbl>       <dbl>    <dbl> <chr>      
 1 River 1                2018-07-01  24.2  22.5 A               NA          NA       NA   "       NA"
 2 River 1                2018-07-02  24.6  20.4 A               NA          NA       NA   "       NA"
 3 River 1                2018-07-03  24.8  20.1 A               NA          NA       NA   "       NA"
 4 River 1                2018-07-04  25.3  20.7 A               NA          NA       NA   "       NA"
 5 River 1                2018-07-05  25.5  20.9 A               NA          NA       NA   "       NA"
 6 River 1                2018-07-06  25.0  21.0 A               NA          NA       NA   "       NA"
 7 River 1                2018-07-07  24.8  20.7 A               24.9        24.9     24.9 "24.899143"
 8 River 1                2018-07-08  23.4  20.8 B               24.8        24.8     24.8 "24.785571"
 9 River 1                2018-07-09  22.7  18.9 E               NA          24.5     24.5 "24.514143"
10 River 1                2018-07-10  22.3  18.2 A               24.4        24.2     24.4 "24.398833"
11 River 1                2018-07-12  22.9  19.0 A               NA          23.5     23.5 "23.531000"
12 River 1                2018-07-13  24.0  19.5 A               NA          23.4     23.4 "23.354000"
13 River 1                2018-07-14  24.5  19.9 A               NA          23.3     23.3 "23.297500"
14 River 1                2018-07-15  25.1  20.6 A               NA          23.6     23.6 "23.583000"
15 River 1                2018-07-19  24.9  20.7 A               NA          NA       NA   "       NA"

**AnilGoyal** · Accepted Answer

I used tidyverse and runner and have done it like this in a single piped syntax. Syntax explanation-

I first collected seven days (as per logic provided) DQL and MAX values into a list using runner.
Before doing that, I have converted DQL into an ordered factored variable, which will be used in last syntax.
Secondly, i used purrr::map to modify each list according to given conditions,
- Not less than six are to be counted
- If there is exactly one E in 7 values, that has not to be counted.
Finally I unnested the list using unnest_wider

library(runner)
daily_data %>% mutate(dyDQL = factor(dyDQL, levels = c("A", "B", "E"), ordered = T),
                      d = runner(x = data.frame(a = dyMax, b= dyDQL),
                                   k = "7 days",
                                   lag = 0,
                                   idx = date,
                                   f = function(x) list(x))) %>%
  mutate(d = map(d, ~ .x %>% group_by(b) %>%
                     mutate(c = n()) %>%
                     ungroup() %>%
                     filter(!n() < 6) %>%
                     filter(!(b == 'E' & c == 1 & n() == 7)) %>%
                     summarise(ma.max7 = ifelse(n() == 0, NA, mean(a)), ma.max7.DQL = max(b))
                   )
         ) %>%
  unnest_wider(d)

# A tibble: 15 x 7
   Monitoring.Location.ID date       dyMax dyMin dyDQL ma.max7 ma.max7.DQL
   <chr>                  <date>     <dbl> <dbl> <ord>   <dbl> <ord>      
 1 River 1                2018-07-01  24.2  22.5 A        NA   NA         
 2 River 1                2018-07-02  24.6  20.4 A        NA   NA         
 3 River 1                2018-07-03  24.8  20.1 A        NA   NA         
 4 River 1                2018-07-04  25.3  20.7 A        NA   NA         
 5 River 1                2018-07-05  25.5  20.9 A        NA   NA         
 6 River 1                2018-07-06  25.0  21.0 A        24.9 A          
 7 River 1                2018-07-07  24.8  20.7 A        24.9 A          
 8 River 1                2018-07-08  23.4  20.8 B        24.8 B          
 9 River 1                2018-07-09  22.7  18.9 E        24.8 B          
10 River 1                2018-07-10  22.3  18.2 A        24.4 B          
11 River 1                2018-07-12  22.9  19.0 A        23.5 E          
12 River 1                2018-07-13  24.0  19.5 A        23.4 E          
13 River 1                2018-07-14  24.5  19.9 A        23.3 E          
14 River 1                2018-07-15  25.1  20.6 A        23.6 E          
15 River 1                2018-07-19  24.9  20.7 A        NA   NA

Vectorize loops when calculating rolling means with variable amounts of data

There are 2 best solutions below

Related Questions in R

Related Questions in APPLY

Related Questions in PURRR

Related Questions in ROLLING-COMPUTATION

Related Questions in RUNNER

Trending Questions

Popular # Hahtags

Popular Questions