Removing overlapping events from data table with intervals

80 Views Asked by At

Given a large (> 100MB) data frame of events with location and timestamps, how can I remove events synchronously occurring in all locations (i.e. putative noise) in R, MATLAB or Python (with reasonable performance)?

A minimal specification of the problem in R would be:

pixel <- c(1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3)
start <- c(1, 3, 6, 8, 1, 3, 5, 7, 8, 1, 4, 7)
end   <- c(2, 4, 7, 9, 2, 4, 6, 8, 9, 3, 5, 9)

events <- data.frame(cbind(pixel, start, end))

# there was an event between 1 and 2s detected everywhere;
# this event would therefore be removed in the desired output:
#
#  pixel start end
#      1     3   4
#      1     6   7
#      1     8   9
#      2     3   4
#      2     5   6
#      2     7   8
#      2     8   9
#      3     4   5
#      3     7   9

I had tried to solve the problem with loops, but the solution is slow. (Experts sometimes recommend to "vectorize" calculations, but I found no way to get rid of the loops.)

Additionally, I had found a related post for the problem in Python on Pandas Data Frame - Remove Overlapping Intervals.

It seems to me that this type of problem should be a common one and is probably already solved by a package, but I couldn't find it.

1

There are 1 best solutions below

1
r2evans On

I think your expected output is incomplete and has more rows than it should. Namely, all three pixels have an event between 1, 2 and 8, 9, so we should be removing two rows from each pixel.

Here's a data.table solution. Note that since we want the comparison to be right-side-open (i.e., 1, 3 does not overlap with 3, 4), I'll momentarily decrease end by an iota, set the keys (required for foverlaps), check for overlaps, then return the iota I subtract.

library(data.table)
events <- data.table(pixel, start, end)

# subtract an iota from `end`, needed for right-side-open
iota <- 1e-9
events[, end := end - iota]
setkey(events, start, end)
events[, overlaps := foverlaps(.SD, events)[, uniqueN(pixel), by = c("i.start", "i.end")]$V1, by = .(pixel)]
# Key: <start, end>
#     pixel start   end overlaps
#     <num> <num> <num>    <int>
#  1:     1     1     2        3
#  2:     2     1     2        3
#  3:     3     1     3        3
#  4:     1     3     4        2
#  5:     2     3     4        2
#  6:     3     4     5        1
#  7:     2     5     6        1
#  8:     1     6     7        1
#  9:     2     7     8        2
# 10:     3     7     9        3
# 11:     1     8     9        3
# 12:     2     8     9        3

The overlaps column now represents a count for how many total unique pixel values are found in the set of overlapping time ranges, including "self". When this number is the same as the number of total unique pixel values, then we have rows that overlap with all other groups.

out <- events[uniqueN(pixel) > overlaps, ][, end := end + iota]
setorder(out, pixel, start, end)
out
#    pixel start   end overlaps
#    <num> <num> <num>    <int>
# 1:     1     3     4        2
# 2:     1     6     7        1
# 3:     2     3     4        2
# 4:     2     5     6        1
# 5:     2     7     8        2
# 6:     3     4     5        1

Follow-up proof, by-row:

    pixel start   end
    <num> <num> <num>
 1:     1     1     2  # overlaps row 5, 10      ALL3
 2:     1     3     4  # overlaps row 6
 3:     1     6     7  # no overlaps
 4:     1     8     9  # overlaps row 9, 12      ALL3
 5:     2     1     2  # overlaps row 1, 10      ALL3
 6:     2     3     4  # overlaps row 2
 7:     2     5     6  # no overlaps
 8:     2     7     8  # no overlaps
 9:     2     8     9  # overlaps row 4, 12      ALL3
10:     3     1     3  # overlaps row 1, 5
11:     3     4     5  # no overlaps
12:     3     7     9  # overlaps rot 4, 12      ALL3

ALL3 rows should be removed (by my interpretation of your logic).