I have a dataframe df:
col1 col2 col3
2020-01-02 08:50:00 360.0 -131.0 -943.0
2020-01-02 08:52:01 342.0 -130.0 -1006.0
2020-01-02 08:55:04 321.0 -130.0 -997.0
... ... ... ...
2022-01-03 14:44:56 1375.0 -91.0 -728.0
2022-01-03 14:50:57 1381.0 -118.0 -692.0
2022-01-03 14:50:58 1382.0 -115.0 -697.0
2022-01-03 14:50:59 1390.0 -111.0 -684.0
2022-01-03 14:55:58 1442.0 -106.0 -691.0
I want a function that obtains the indices that:
Are NOT within a specific time (e.g., 5 minutes) of each other.
For example:
masked_df = time_mask(df.index, pd.Timedelta(minutes=5))
masked_df:
col1 col2 col3
2020-01-02 08:50:00 360.0 -131.0 -943.0
2020-01-02 08:55:04 321.0 -130.0 -997.0
... ... ... ...
2022-01-03 14:44:56 1375.0 -91.0 -728.0
2022-01-03 14:50:57 1381.0 -118.0 -692.0
2022-01-03 14:55:58 1442.0 -106.0 -691.0
The function time_mask should obtain the first index that is not within 5 minutes of the previously added index. Below is my iterative attempt to solve this problem:
def get_clean_ix_from_rolling(idx, time_delt):
clean_ix = []
prev_ix = idx[0]
clean_ix.append(prev_ix)
for i, x in enumerate(idx):
if((x-prev_ix) >= time_delt):
clean_ix.append(x)
prev_ix = x
ix = pd.to_datetime(clean_ix)
return ix
How can I speed up my code above?
Shift your index by one row using
.shift()and subtract it by the value of next value using.sub(). Get the minute difference usingastypeand check if it is equal to your time_delta using.eq(). Finally mask the index and get the results:Output:
Output:
Edit:
I doubt that you can do what you described in the comments without an explicit for loop, in which case you'd better use
itertools()oritertuple(). They will be faster, but I don't recommend using them either. Consider a change in approach like the following:Output:
Output: