Imagine a tank of some volatile fluid which evaporates at a predictable rate. The tank receives some amount of the fluid every now and then and we want to model how much fluid is in the tank each time new fluid is added.
(Note this is not my actual problem, it is an analogy).
First, generate some sample data
library(tidyverse)
rm(list = ls())
set.seed(123)
original_data <-
tibble(
t = sample(50, 10, replace = FALSE),
amount_incoming = runif(n = 10)
)
Take a look at the data
original_data |>
ggplot() +
aes(x = t, y = amount_incoming) +
geom_col(width = 0.7)
The data does not contain rows to represent the amount in between adding new fluid
# A tibble: 10 × 2
t amount_incoming
<int> <dbl>
1 31 0.900
2 15 0.246
3 14 0.0421
4 3 0.328
5 42 0.955
6 43 0.890
7 37 0.693
8 48 0.641
9 25 0.994
10 26 0.656
In order to compute this amount I must 'complete' the data set by adding the missing rows
completed_data <-
original_data |>
complete(t = full_seq(t, period = 1)) |>
mutate(amount_incoming = replace_na(amount_incoming, 0))
# A tibble: 46 × 2
t amount_incoming
<dbl> <dbl>
1 3 0.328
2 4 0
3 5 0
4 6 0
5 7 0
6 8 0
7 9 0
8 10 0
9 11 0
10 12 0
# ℹ 36 more rows
I use a slightly modified version of the convolve function to get the behaviour I want:
my_convolve = function(x, f){
n_x = length(x)
convolve(x, f, type = "o") |> head(n_x) |> zapsmall()
}
decay_rate_filter <- dexp(100:0, rate = 1)
convolved_data <-
completed_data |>
mutate(
amount_held = my_convolve(amount_incoming, decay_rate_filter)
)
convolved_data |>
ggplot() +
aes(x = t, y = amount_held) +
geom_col(width = 0.7)
But I have no use for the extra rows so I filter them out
final_data <-
convolved_data |>
filter(amount_incoming != 0)
My actual problem involves a lot of data, and using this method is very slow. The slow part is the complete function from the tidyr package.
Is there an efficient way to get to my end state without using complete?
My last resort would be to write my own custom convolve function using Rcpp but wondering if there is some pre-existing solution, since I imagine I am not the first to have this requirement.


A data.table version which (ab)uses the key to create a equivalent of the tidyr complete function as called. It appears to scale well in my tests with a 500 million range of
tand 10 million incoming amounts.This assumes that your
original_datatvalues are in the expanded sequence of complete valuesseq(min(t), max(t), 1), and it may give unexpected downstream results if thetvalues inoriginal_dataare not unique.