Use of "cumsum", "difftime" and "lag"

103 Views Asked by At

I have three columns in my dataset that I work on for this question.

The first column is called buffer and shows whether a GPS point of an animal is inside the buffer zone (yes/no values). The second is datetime of the GPS point of the animal, the third is the time spent inside the buffer zone (dt1).

What I am trying to do is that if I have a "yes" row in the buffer column, between two "no" rows, I want to calculate the time difference between this gps point and the previous one and print it on dt1, which I managed.

The problem is when I try to calculate the time difference between the first and the last "yes" rows when looking at multiple consecutive "yes" rows, meaning that the animal stayed for more time in the buffer zone and thus consecutive GPS points are inside the buffer.

Here you can see my code. The problem is that it returns "NA" for "yes" rows that follow other "yes" rows, basically any "yes" row that is not isolated. I am trying to get the overall time difference in the final "yes" row in a "yes-row" series.

trips_with_buffer_2016_df <- trips_with_buffer_2016_df %>%
  group_by(tripID) %>%
  mutate(
    dt1 = ifelse(buffer == 'yes', 
                 ifelse(lag(buffer, default = 'no') == 'no', 
                        difftime(DateTime, lag(DateTime), units = "mins"),
                        cumsum(as.numeric(difftime(DateTime, lag(DateTime,), units = "mins")))
                 ), 
                 NA_real_)
  )

the "tripID" column groups the gps point by previously identified trips.

I know that the problem is in the cumsum line, but i cannot get it to work. The two rows with the problem showing Thanks a lot in advance!

3

There are 3 best solutions below

1
On

You could simplify your indexing of in or out of buffer using rle, here 0 = out, 1 = in, but could be 'yes', 'no'

buffer = c(1,0,0,0,1,1,1,1,0,0,0,0,1,0,0,1,1,1,1,1,0,0)
rle(buffer)
Run Length Encoding
  lengths: int [1:8] 1 3 4 4 1 2 5 2
  values : num [1:8] 1 0 1 0 1 0 1 0

end = cumsum(rle(buffer)$lengths)
end
[1]  1  4  8 12 13 15 20 22
start = end - rle(buffer)$lengths +1
start
[1]  1  2  5  9 13 14 16 21

inbuf = which(rle(buffer)$values == 1)
inbuf
[1] 1 3 5 7

start[inbuf]
[1]  1  5 13 16

end[inbuf]
[1]  1  8 13 20

time = seq(5, 110, 5)

sum(diff(time[start[inbuf][2]:end[inbuf][2]]))
[1] 15

Another way to think about it, and possibly easier to see what's happening in the future when things are all but forgotten, and an error pops up.

0
On

In the end I split my dataframe further into groups. Each time there was a "no" in the buffer column a new group would be created, this would keep all the consecutive "yes" rows together. Then i grouped my dataframe based on the "tripid" and this new column "id" and summed all the values within, pasting it in dt2 column. Lastly I created a dt3 column which isolated only one value from the dt2 column for each group. I attach the code here, I know some parts might be repetitive but it worked. Have a good day!

    trips_with_buffer_2016_df <- trips_with_buffer_2016_df %>%
  group_by(tripID) %>%
  mutate(
    dt1 = ifelse(buffer == 'yes', 
                 ifelse(lag(buffer, default = 'no') == 'no', 
                        difftime(DateTime, lag(DateTime), units = "mins"),
                       difftime(DateTime, lag(DateTime,), units = "mins")
                 ), 
                 NA_real_)
  )


trips_with_buffer_2016_df <- trips_with_buffer_2016_df %>%
  group_by(tripID, cumsum(buffer == 'no')) %>%
  mutate(
    dt2 = ifelse(buffer == 'yes', 
                 ifelse(any(buffer == 'yes'), 
                        sum(as.numeric(dt1), na.rm = TRUE),
                        NA_real_
                 ), 
                 NA_real_)
  ) %>%
  ungroup()


trips_with_buffer_2016_df <- trips_with_buffer_2016_df %>%
  group_by(tripID, cumsum(buffer == 'no')) %>%
  mutate(
    dt3 = ifelse(buffer == 'yes' & row_number() == which(buffer == 'yes')[1], 
                 ifelse(any(buffer == 'yes'), 
                        sum(as.numeric(dt1), na.rm = TRUE),
                        NA_real_
                 ), 
                 NA_real_)
  ) %>%
  ungroup()
1
On

I don't think you need to use nested ifelse if you have unique trip IDs, you should be able to just summarize and get the min/max times per trip and calculate difftime based on that.

trips_with_buffer_2016_df <- trips_with_buffer_2016_df %>%
  group_by(tripID, buffer) %>%
  summarise(MIN_TIME=min(DateTime), MAX_TIME=max(DateTime)) %>%
    mutate(dt1=difftime(MAX_TIME, MIN_TIME, units="mins"))