I have a data frame of customer digital visit over time in the form:
|cust_id|datetime|
|1|2020-08-15 15:20|
|1|2020-08-15 16:20|
|1|2020-08-17 12:20|
|1|2020-08-19 14:20|
|1|2020-08-23 09:20|
|2|2020-08-24 08:00|
I'd like to pick out strong signals, as in : customers who visit at least 3 times in 5 days.
My initial thought is that we have to compute ALL sliding window for each customer.
In this example, let's take cust1 :
5-day window starting 2020-08-15, ending 2020-08-19, total visit is 4
5-day window starting 2020-08-16, ending 2020-08-20, total visit is 2
5-day window starting 2020-08-17, ending 2020-08-21, total visit is 2
etc.
Max count of all sliding window is 4. Therefore cust1 fits the criteria "having visited at least 3 times in 5 days"
This seems to be a costly operation.
How would you implement this efficiently ? Any other idea is welcome.
You can convert the
datetime
column tolong
and pass in the number of seconds equivalent to 5 days in the rangeBetween() function.To get the maximum number of 5-day visits for each customer, you can do: