I have a dataset with group_ids. I want to attach unique ids to the groups when they are consistently present in my dataset, meaning they disappear for maximum 5 seconds. If they disappear more than 5 seconds, they should get a new accumulative id. If they stay present or only disappear for 5 seconds, they should keep the same accumulative number.

Here is my dataset:

    group_id    dt2         expect_id
0   nan 2023-11-28 17:43:09.900628  1
1   1   2023-11-28 17:43:11.322793  2
2   1   2023-11-28 17:43:12.660818  2
3   1   2023-11-28 17:43:14.119043  2
4   1   2023-11-28 17:43:15.550513  2
5   2   2023-11-28 17:43:15.550513  3
6   3   2023-11-28 17:43:15.550513  4
7   4   2023-11-28 17:43:15.550513  5
8   1   2023-11-28 17:43:16.973557  2
9   2   2023-11-28 17:43:16.973557  3
10  3   2023-11-28 17:43:16.973557  4
11  4   2023-11-28 17:43:16.973557  5
12  1   2023-11-28 17:43:18.335619  2
13  2   2023-11-28 17:43:18.335619  3
14  3   2023-11-28 17:43:18.335619  4
15  4   2023-11-28 17:43:18.335619  5
16  1   2023-11-28 17:43:19.738230  2
17  2   2023-11-28 17:43:19.738230  3
18  3   2023-11-28 17:43:19.738230  4
19  4   2023-11-28 17:43:19.738230  5
20  1   2023-11-28 17:43:21.110693  2
21  2   2023-11-28 17:43:21.110693  3
22  1   2023-11-28 17:43:22.571257  2
23  2   2023-11-28 17:43:22.571257  3
24  1   2023-11-28 17:43:24.000589  2
25  1   2023-11-28 17:43:25.429940  2
26  2   2023-11-28 17:43:25.429940  3
27  1   2023-11-28 17:43:26.851142  2
28  2   2023-11-28 17:43:26.851142  3
29  1   2023-11-28 17:43:28.256274  2
30  nan 2023-11-28 17:43:29.617541  4
31  nan 2023-11-28 17:43:30.974490  4
32  nan 2023-11-28 17:43:32.360739  4
33  1   2023-11-28 17:43:33.730457  5
34  2   2023-11-28 17:43:33.730457  6
35  1   2023-11-28 17:43:35.270380  5
34  2   2023-11-28 17:43:35.270380  6
35  1   2023-11-28 17:43:35.270380  5
34  2   2023-11-28 17:43:36.730457  6
35  1   2023-11-28 17:43:36.270380  5

I tried this cumsum() method to create my "unique_id" column:

df['group_id'] = (df['group_id'].eq(0) | (df['group_id'] != df['group_id'].shift())).cumsum()

This is the right way forward, but it attaches a new accumulative value to my groups, even if they are present based on the datetime column. Is there a way to implement a logic that sets that only attaches a new accumulative value as unique_id if the group_id appears within at least 5 seconds?

1

There are 1 best solutions below

6
On

Here, for clarity, 10 seconds are added in lines 12 and 20. In the created column 'delta' the row difference is in seconds.

import pandas as pd


df['dt2'] = pd.to_datetime(df['dt2'], errors='raise')
df.loc[[12, 20], 'dt2'] +=pd.to_timedelta(10, unit='s')

df['delta'] = df['dt2'].diff() / np.timedelta64(1, 's')

df['test'] = (df['delta'] >= 5).cumsum()

print(df)

Update: 01.01.2023

it's more like what you want. My telepathic abilities are not so strong. Therefore, it is not clear what is happening with the lines where NaN?

df['test'] = ((df['group_id'] == 1)  & (df['group_id'] < df['group_id'].shift())).cumsum()
df['boolean'] = (df['group_id'].eq(0) | (df['group_id'] != df['group_id'].shift()))
df['cs'] = df.groupby('test')['boolean'].transform(pd.Series.cumsum)