How to efficiently compute a convolution with missing data rows, without expanding the missing rows?

69 Views Asked by At

Imagine a tank of some volatile fluid which evaporates at a predictable rate. The tank receives some amount of the fluid every now and then and we want to model how much fluid is in the tank each time new fluid is added.

(Note this is not my actual problem, it is an analogy).

First, generate some sample data

library(tidyverse)

rm(list = ls())

set.seed(123)

original_data <-
  tibble(
    t = sample(50, 10, replace = FALSE),
    amount_incoming = runif(n = 10)
  )

Take a look at the data

original_data |>
  ggplot() +
  aes(x = t, y = amount_incoming) +
  geom_col(width = 0.7)

enter image description here

The data does not contain rows to represent the amount in between adding new fluid

# A tibble: 10 × 2
       t amount_incoming
   <int>           <dbl>
 1    31          0.900 
 2    15          0.246 
 3    14          0.0421
 4     3          0.328 
 5    42          0.955 
 6    43          0.890 
 7    37          0.693 
 8    48          0.641 
 9    25          0.994 
10    26          0.656 

In order to compute this amount I must 'complete' the data set by adding the missing rows

completed_data <-
  original_data |>
  complete(t = full_seq(t, period = 1)) |>
  mutate(amount_incoming = replace_na(amount_incoming, 0))

# A tibble: 46 × 2
       t amount_incoming
   <dbl>           <dbl>
 1     3           0.328
 2     4           0    
 3     5           0    
 4     6           0    
 5     7           0    
 6     8           0    
 7     9           0    
 8    10           0    
 9    11           0    
10    12           0    
# ℹ 36 more rows

I use a slightly modified version of the convolve function to get the behaviour I want:

my_convolve = function(x, f){
  n_x = length(x)
  convolve(x, f, type = "o") |> head(n_x) |> zapsmall()
}

decay_rate_filter <- dexp(100:0, rate = 1)

convolved_data <-
  completed_data |>
  mutate(
    amount_held = my_convolve(amount_incoming, decay_rate_filter)
  )

convolved_data |>
  ggplot() +
  aes(x = t, y = amount_held) +
  geom_col(width = 0.7)

enter image description here

But I have no use for the extra rows so I filter them out

final_data <-
  convolved_data |>
  filter(amount_incoming != 0)

My actual problem involves a lot of data, and using this method is very slow. The slow part is the complete function from the tidyr package. Is there an efficient way to get to my end state without using complete?

My last resort would be to write my own custom convolve function using Rcpp but wondering if there is some pre-existing solution, since I imagine I am not the first to have this requirement.

1

There are 1 best solutions below

0
CPB On

A data.table version which (ab)uses the key to create a equivalent of the tidyr complete function as called. It appears to scale well in my tests with a 500 million range of t and 10 million incoming amounts.

library(data.table)

set.seed(123)

original_data <- data.table(
  t = sample(50, 10, replace = FALSE),
  amount_incoming = runif(n = 10)
)

# This is equivalent to complete(original_data, t = full_seq(t, period = 1))
setkey(original_data, t)
dt_complete <- original_data[.(seq(min(t), max(t), 1))]

This assumes that your original_data t values are in the expanded sequence of complete values seq(min(t), max(t), 1), and it may give unexpected downstream results if the t values in original_data are not unique.