I have a table full of dates and reply-to times for technical support tickets, and I'd like to calculate a running rate to find the average reply time over the previous n number of days. The data is in the following format.
| Dates | Reply Time | Ticket ID |
|---|---|---|
| 2024-01-02 | 341 | 1 |
| 2024-01-02 | 31 | 2 |
| 2024-01-03 | 321 | 3 |
| 2024-01-05 | 412 | 4 |
| 2024-01-07 | 93 | 5 |
| 2024-01-07 | 169 | 6 |
I can solve this problem by calculating the average reply time each day and then calculating the average reply time over the previous n number of days from that, but this doesn't take into account the number of observations each day, which will skew the results if there are outliers for certain days. I want to account for the number of observations when I'm averaging to prevent outliers from throwing off the data.
Here I'm using the package runner to get the average reply time each day and calculating a moving average from that.
daily_reply_time <- df_replies %>%
filter(!is.na(reply_time) & !is.na(dates)) %>%
group_by(dates) %>%
reframe(avg_reply_time = mean(reply_time, na.rm = TRUE)) %>%
mutate(
x = "x",
dates = lubridate::ymd(dates)
) %>%
filter(!is.na(dates)) %>%
complete(
nesting(x),
dates = seq(min(dates), max(dates), by = "day")
) %>%
group_by(x) %>%
arrange(dates) %>%
mutate(
dates= lubridate::ymd(dates),
avg_reply_time = ifelse(is.na(avg_reply_time), 0, as.numeric(avg_reply_time )),
running_reply_time_30_days = runner::mean_run(x = avg_reply_time, k = 30, idx = dates)
) %>%
select(-x)
I create a dummy variable x so that nesting works properly; I assume there's a way to skip over that I'm unaware of. Anyways, this would give me averages of 186, 321, 0, 412, 0, and 131 for each day, so when I use runner, I get a moving average of 175 on 2024-01-08, instead of the 227.83 that is expected when you just sum up all the numbers and divide by the number of observations.
If I skip grouping by each date and instead use the complete function, I get an error saying that "'from' must be a finite number." And not using complete and trying to use the runner package does not throw an error, but averages over the previous n rows in the dataset instead of the number of dates.
daily_reply_time <- df_replies %>%
filter(created_at > '2023-12-31') %>%
mutate(
created_at = substr(created_at, 1, 10),
first_reply_time_in_minutes = first_reply_time_in_minutes / 60
) %>%
filter(!is.na(created_at) & !is.na(first_reply_time_in_minutes)) %>%
mutate(x = "x") %>%
complete(
nesting(x),
created_at = seq(min(created_at), max(created_at), by = "day")
)
Is there a way to account for the number of observations when calculating running amounts using runner, or through some other package?
EDIT: The expected output would contain one row for each date, and a moving average reply time over the previous n days, meaning the input and output would have different lengths (amounts are not supposed to match up with the table above, just an example of the expected output given a particular dataframe).
| Dates | Moving Avg. Reply Time |
|---|---|
| 2024-01-02 | 125 |
| 2024-01-03 | 108.3 |
| 2024-01-04 | 108.3 |
| 2024-01-05 | 137 |
| 2024-01-06 | 67 |
| 2024-01-07 | 251 |
To do it I would use the
roll...functions from the{zoo}package.First let’s generate some data. each day in january will be present at least once.
Now let’s calculate the
total_reply_timeand thenumber_of_ticketsby day.The last step is to get the weighted rolling average
Created on 2024-03-26 with reprex v2.0.2