Group consecutive rows using spark scala with rows repeating

41 Views Asked by At
--------------+-------------------------+
| space_id   |template   |frequency| day         |timestamp               |
+------------------------------------+-----------+---------+-----------
|321d8|temp|15|2023-02-22T00:00:00+05:30|2023-02-22T09:00:00+05:30|
|321d8|temp|15|2023-02-22T00:00:00+05:30|2023-02-22T09:15:00+05:30|
|321d8|temp|15|2023-02-22T00:00:00+05:30|2023-02-22T09:30:00+05:30|
|321d8|temp|15|2023-02-22T00:00:00+05:30|2023-02-22T09:45:00+05:30|
|321d8|temp|15|2023-02-22T00:00:00+05:30|2023-02-22T10:00:00+05:30|

Here I have a unique id as space_id, Template(which may have temperature, humidity, CO2), frequency column which says what is the frequency in which I receive the data from a sensor, a day column and finally a timestamp column Here I need to group the data in 30 minute batch according to the timestamp

I am able to find 30minutes batches as 09:00:00,09:15:00 & 09:30:00 in one batch and next 09:30:00,09:45:00,10:00:00 so on. But what I need is 09:00:00,09:15:00 & 09:30:00 and 09:15:00, 09:30:00 ,09:45:00 , 09:30:00 ,09:45:00, 10:00:00 so on I need to make slots for 30minute batch for each timestamp value In Simple words. From above table. I need groups of rows(1,2,3), rows(2,3,4),row(3,4,5) so on..

1

There are 1 best solutions below

0
On

The window setting you're looking for is:

from pyspark.sql import Window

w = Window.partitionBy('space_id').orderBy('timestamp').rowsBetween(Window.currentRow, Window.currentRow + 2)