I have a dataframe of daily water chemistry values taken from deployed sensors. I’m trying to calculate rolling 7 day averages of daily maximum values. This in in-situ environmental data, the data can be a bit messy.
Here are the rules for calculating the averages and assigning quality levels:
- Data is graded and given a quality value (DQL) for the day (dyDQL).
- 'A' is high quality, 'B' is medium, and 'E' is poor.
- 7 day average is calculated at the end of a 7 day period.
- Dataset needs only 6 complete days to calculate a 7 day average (Can miss 1 day of data)
- If there are at least 6 days worth ‘A’ and ‘B’ graded data and 1 day of ‘E’, discard ‘E’ data and calculate the 7-day average using the 6 days of ‘A’ and ‘B’ data
I have the code working using a loop that loops through each result, creates a new dataframe containing the 7 day window, and then calculates the moving average. See minimal example below.
Note that there are missing dates for the 11th, 16th, 17th, and 18th in this example:
daily_data <- tibble::tribble(
~Monitoring.Location.ID, ~date, ~dyMax, ~dyMin, ~dyDQL,
"River 1", as.Date("2018-07-01"), 24.219, 22.537, "A",
"River 1", as.Date("2018-07-02"), 24.557, 20.388, "A",
"River 1", as.Date("2018-07-03"), 24.847, 20.126, "A",
"River 1", as.Date("2018-07-04"), 25.283, 20.674, "A",
"River 1", as.Date("2018-07-05"), 25.501, 20.865, "A",
"River 1", as.Date("2018-07-06"), 25.04, 21.008, "A",
"River 1", as.Date("2018-07-07"), 24.847, 20.674, "A",
"River 1", as.Date("2018-07-08"), 23.424, 20.793, "B",
"River 1", as.Date("2018-07-09"), 22.657, 18.866, "E",
"River 1", as.Date("2018-07-10"), 22.298, 18.2, "A",
"River 1", as.Date("2018-07-12"), 22.92, 19.008, "A",
"River 1", as.Date("2018-07-13"), 23.978, 19.532, "A",
"River 1", as.Date("2018-07-14"), 24.508, 19.936, "A",
"River 1", as.Date("2018-07-15"), 25.137, 20.627, "A",
"River 1", as.Date("2018-07-19"), 24.919, 20.674, "A"
)
for (l in seq_len(nrow(daily_data))){
station_7day <- filter(daily_data,
dplyr::between(date, daily_data[[l,'date']] - lubridate::days(6), daily_data[l,'date']))
daily_data[l,"ma.max7"] <- dplyr::case_when(nrow(subset(station_7day, dyDQL %in% c("A")))== 7 & l >=7 ~ mean(station_7day$dyMax),
nrow(subset(station_7day, dyDQL %in% c("A", 'B'))) >= 6 & l >=7~ mean(station_7day$dyMax),
max(station_7day$dyDQL == 'E') & nrow(subset(station_7day, dyDQL %in% c("A", "B"))) >= 6 & l >=7 ~ mean(station_7day$dyMax[station_7day$dyDQL %in% c("A", "B")]),
nrow(subset(station_7day, dyDQL %in% c("A", "B", "E"))) >= 6 & l >=7~ mean(station_7day$dyMax),
TRUE ~ NA_real_)
daily_data[l, "ma.max7_DQL"] <- dplyr::case_when(nrow(subset(station_7day, dyDQL %in% c("A")))== 7 & l >=7 ~ "A",
nrow(subset(station_7day, dyDQL %in% c("A", 'B'))) >= 6 & l >=7~ "B",
max(station_7day$dyDQL == 'E') & nrow(subset(station_7day, dyDQL %in% c("A", "B"))) >= 6 & l >=7 ~ "B",
nrow(subset(station_7day, dyDQL %in% c("A", "B", "E"))) >= 6 & l >=7~ "E",
TRUE ~ NA_character_)
}
The expected results are:
tibble::tribble(
~Monitoring.Location.ID, ~date, ~dyMax, ~dyMin, ~dyDQL, ~ma.max7, ~ma.max7_DQL,
"River 1", as.Date("2018-07-01"), 24.219, 22.537, "A", NA, NA,
"River 1", as.Date("2018-07-02"), 24.557, 20.388, "A", NA, NA,
"River 1", as.Date("2018-07-03"), 24.847, 20.126, "A", NA, NA,
"River 1", as.Date("2018-07-04"), 25.283, 20.674, "A", NA, NA,
"River 1", as.Date("2018-07-05"), 25.501, 20.865, "A", NA, NA,
"River 1", as.Date("2018-07-06"), 25.04, 21.008, "A", NA, NA,
"River 1", as.Date("2018-07-07"), 24.847, 20.674, "A", 24.8991428571429, "A",
"River 1", as.Date("2018-07-08"), 23.424, 20.793, "B", 24.7855714285714, "B",
"River 1", as.Date("2018-07-09"), 22.657, 18.866, "E", 24.5141428571429, "B",
"River 1", as.Date("2018-07-10"), 22.298, 18.2, "A", 24.15, "B",
"River 1", as.Date("2018-07-12"), 22.92, 19.008, "A", 23.531, "E",
"River 1", as.Date("2018-07-13"), 23.978, 19.532, "A", 23.354, "E",
"River 1", as.Date("2018-07-14"), 24.508, 19.936, "A", 23.2975, "E",
"River 1", as.Date("2018-07-15"), 25.137, 20.627, "A", 23.583, "E",
"River 1", as.Date("2018-07-19"), 24.919, 20.674, "A", NA, NA
)
The code works fine, but is very slow when calculating values for multi-year levels of data with multiple different water quality parameters at multiple locations.
Due to the fact that a 7 day value can be calculated from 6 days of data, I don’t think I can use any of the rolling functions from the zoo package. I don’t think I can use the roll_mean function from the roll package, due to the variable nature of discarding 1 days worth of ‘E’ data when there is 6 days of ‘A’ or ‘B’ data.
Is there way to vectorize this, in order to avoid looping through every row of data?
I used
tidyverse
andrunner
and have done it like this in a single piped syntax. Syntax explanation-runner
.purrr::map
to modify each list according to given conditions,E
in 7 values, that has not to be counted.unnest_wider