Excluding variables and those near them- r dplyr

63 Views Asked by At

I have a dataset which is the output of multiple data loggers measuring temperature and lux (strength of light) at 1-hour intervals.
There are approx. 250,000 data points. I'm having trouble with temperature readings from 'sun flecks' where a shaft of lights hits the logger, heating it up quickly and then giving 'warm' readings for the rest of the day. I can use dplyr to subset these data (i.e. LUX>32,000) but I would like to remove all readings from that day if the logger had a LUX>32,000 reading. For ref each data logger has a name, date & time variables.

Is there a way to do this with dplyr?

2

There are 2 best solutions below

0
On

If I remember right, filter doesn't work well with grouped data, so I'm first sorting the data frame by times (this may not be necessary if your data is already sorted appropriately). Then, for each logger and date, I'm identifying all points after a LUX > 32000 event and marking them. With that done, the filter should work.

df %>%
  arrange(name, date, time) %>% 
  group_by(name, date) %>%
  mutate(
    fleck = cumsum(LUX > 32000) > 0
  ) %>%
  ungroup() %>%
  filter(!fleck)

Edit

If you want to remove the entire day, you can change how the fleck variable is defined. For example,

fleck = any(LUX > 32000)
0
On

You can use a somewhat simple function like this:

beforeafter <- function(lgl, before=1L, after=1L, default=FALSE) {
  befores <- if (before > 0L) sapply(seq_len(before), function(i) c(tail(lgl, n=-i), rep(default, i))) else c()
  afters <- if (after > 0L) sapply(seq_len(after), function(i) c(rep(default, i), head(lgl, n=-i))) else c()
  apply(cbind(befores, lgl, afters), 1, any)
}

vec <- (1:10 == 5)
vec
#  [1] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
beforeafter(vec)
#  [1] FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE
beforeafter(vec, before=2, after=0)
#  [1] FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE

As an example:

rownames_to_column(mtcars) %>%
  select(rowname, cyl, gear) %>%
  filter(cyl == 4L, gear == 3L)
#         rowname cyl gear
# 1 Toyota Corona   4    3

rownames_to_column(mtcars) %>%
  select(rowname, cyl, gear) %>%
  filter(beforeafter(cyl == 4L & gear == 3L))
#            rowname cyl gear
# 1   Toyota Corolla   4    4
# 2    Toyota Corona   4    3
# 3 Dodge Challenger   8    3

This works well if your data is a constant frequency and you want to remove all observations within some constant samples from a known problem. It does not work as well when you want "within some time" from variable-frequency data. For that, I think you'll need dist iteratively on all "known bad" points.