How to count consecutive days an event happens?

360 Views Asked by At

I need to calculate the number of consecutive days from today (2022-01-04) backwards a client logged in my application. I need to use pyspark due to the size of my database

Input

Name       Date         
    John    2022-01-01  
    John    2022-01-01
    Mary    2022-01-01
    Steve   2022-01-03
    Mary    2022-01-03
    John    2022-01-02
    John    2022-01-03
    Mary    2022-01-04
    John    2022-01-04

Output

Name       consecutive_days          
    John         4
    Mary         2 
2

There are 2 best solutions below

2
On

Maybe you could do something like this,

from datetime import date

d1 = date(2022, 2, 6)
d2 = date(2022, 2, 16)
X = d2 - d1
print("Consecutive Days = ",X.days)
0
On

You can calculate the number of days between to dates using datediff and lag. Then count the dates that have the difference as 1. Add a 1 to that to get your number of consecutive days.

data_sdf. \
    select('name', 'dt'). \
    dropDuplicates(). \
    withColumn('gap', 
               func.datediff('dt', func.lag('dt').over(wd.partitionBy('name').orderBy('dt')))
               ). \
    filter(func.col('gap') == 1). \
    groupBy('name'). \
    agg((func.sum('gap') + 1).alias('consecutive_days')). \
    show()

# +----+----------------+
# |name|consecutive_days|
# +----+----------------+
# |Mary|               2|
# |John|               4|
# +----+----------------+