Find Longest Continuous Streak In Spark

240 Views Asked by At

I have a single column dataframe like this

------------
   date     
------------
01/01/2020       
02/01/2020  
04/01/2020    
05/01/2020    
06/01/2020 

I have to get the longest continuous period the start date and the end date . So in the above example I have a output like this

-----------------------------------------------
start       |   end           |  period_length |
-----------------------------------------------
04/01/2020    06/01/2020             3         

My approach: Sort the data and find the lag with the previous row and whenever there is a lag > 1 , reset the period length But I am unable to figure out a way to reset the period on a particular condition. I am using spark 2.3

1

There are 1 best solutions below

0
On

Note: My column name is "eventTime" like "2020-12-14 13:49:32"

  sc.sql(
  """
    |
    |   select
    |     min(eventTime), max(eventTime) ,  count(1)  as counts
    |   from
    |   (
    |       select
    |           eventTime , date_sub(eventTime , rn) as dis
    |       from
    |       (
    |           select
    |               eventTime , row_number() over(partition by 1 order by eventTime) rn
    |           from (select distinct substring(eventTime,0,10) as eventTime from ST_INOUT_RECORD)
    |       ) t1
    |   ) t2
    |   group by dis  having counts > 2
    |
    |""".stripMargin).show()

Result

|min(eventTime)|max(eventTime)|counts|
+--------------+--------------+------+
|    2020-09-12|    2020-12-14|    94|
|    2020-01-01|    2020-09-10|   254|
+--------------+--------------+------+