Efficiently Iterate Through Datetime Index to Set Time Boundaries

Question

Efficiently Iterate Through Datetime Index to Set Time Boundaries

56 Views Asked by Vanillihoot At 02 January 2023 at 20:44

I have a dataframe df:

                       col1     col2    col3
2020-01-02 08:50:00    360.0    -131.0  -943.0
2020-01-02 08:52:01    342.0    -130.0  -1006.0
2020-01-02 08:55:04    321.0    -130.0  -997.0
... ... ... ...
2022-01-03 14:44:56    1375.0   -91.0   -728.0
2022-01-03 14:50:57    1381.0   -118.0  -692.0
2022-01-03 14:50:58    1382.0   -115.0  -697.0
2022-01-03 14:50:59    1390.0   -111.0  -684.0
2022-01-03 14:55:58    1442.0   -106.0  -691.0

I want a function that obtains the indices that:

Are NOT within a specific time (e.g., 5 minutes) of each other.

For example:

masked_df = time_mask(df.index, pd.Timedelta(minutes=5))

masked_df:

                       col1     col2    col3
2020-01-02 08:50:00    360.0    -131.0  -943.0
2020-01-02 08:55:04    321.0    -130.0  -997.0
... ... ... ...
2022-01-03 14:44:56    1375.0   -91.0   -728.0
2022-01-03 14:50:57    1381.0   -118.0  -692.0
2022-01-03 14:55:58    1442.0   -106.0  -691.0

The function time_mask should obtain the first index that is not within 5 minutes of the previously added index. Below is my iterative attempt to solve this problem:

def get_clean_ix_from_rolling(idx, time_delt):
    
    clean_ix = []
    prev_ix = idx[0]
    clean_ix.append(prev_ix)
    for i, x in enumerate(idx):
        if((x-prev_ix) >= time_delt):
            clean_ix.append(x)
            prev_ix = x

    ix = pd.to_datetime(clean_ix)
    return ix

How can I speed up my code above?

Original Q&A

There are 1 best solutions below

**ali bakhtiari** · Answer 1 · 2023-01-02T23:00:23.943000

Shift your index by one row using .shift() and subtract it by the value of next value using .sub(). Get the minute difference using astype and check if it is equal to your time_delta using .eq(). Finally mask the index and get the results:

import pandas as pd

# Create some sample data
df = pd.DataFrame(index=["2020-01-02 08:50:00",
                           "2020-01-02 08:55:04",
                           "2022-01-03 14:44:56",
                           "2022-01-03 14:50:57",
                           "2022-01-03 14:55:57",
                           "2022-01-03 14:56:57",
                           "2022-01-03 14:57:57",
                           "2022-01-03 14:58:57",
                           "2022-01-03 14:59:57"])
                                                      
df.index = pd.to_datetime(df.index)
df

Output:

2020-01-02 08:50:00
2020-01-02 08:55:04
2022-01-03 14:44:56
2022-01-03 14:50:57
2022-01-03 14:55:57
2022-01-03 14:56:57
2022-01-03 14:57:57
2022-01-03 14:58:57
2022-01-03 14:59:57

time_delta = 5
idx = df.index.to_series()
s = idx.shift(-1).sub(idx).astype('timedelta64[m]').eq(time_delta)
df[~s]

Output:

2020-01-02 08:55:04
2022-01-03 14:44:56
2022-01-03 14:55:57
2022-01-03 14:56:57
2022-01-03 14:57:57
2022-01-03 14:58:57
2022-01-03 14:59:57

Edit:

I doubt that you can do what you described in the comments without an explicit for loop, in which case you'd better use itertools() or itertuple(). They will be faster, but I don't recommend using them either. Consider a change in approach like the following:

import pandas as pd
import numpy as np
# Create some sample data
dates = ["2022-01-02 08:50:00",
                           "2022-01-02 08:55:04",
                           "2022-01-03 14:44:56",
                           "2022-01-03 14:50:57",
                           "2022-01-03 14:55:57",
                           "2022-01-03 15:00:57",
                           "2022-01-03 15:02:57",
                           "2022-01-03 15:05:57",
                           "2022-01-03 15:08:57",
                           "2022-01-03 15:10:57"]
df = pd.DataFrame(index=pd.to_datetime(dates),
                  data=np.random.randint(low=-100, high=100, size=(10, 3))).add_prefix('col')
 df

Output:

                 col0 col1 col2
2022-01-02 08:50:00 1   43  71
2022-01-02 08:55:04 42  86  20
2022-01-03 14:44:56 -47 -97 64
2022-01-03 14:50:57 -5  76  59
2022-01-03 14:55:57 67  98  -36
2022-01-03 15:00:57 -44 85  33
2022-01-03 15:02:57 -8  14  67
2022-01-03 15:05:57 85  -84 -83
2022-01-03 15:08:57 -86 46  81
2022-01-03 15:10:57 -77 29  -63

df.resample('5min').agg('first').dropna(axis=0)

Output:

                    col0    col1    col2
2022-01-02 08:50:00 1.0     43.0    71.0
2022-01-02 08:55:00 42.0    86.0    20.0
2022-01-03 14:40:00 -47.0   -97.0   64.0
2022-01-03 14:50:00 -5.0    76.0    59.0
2022-01-03 14:55:00 67.0    98.0    -36.0
2022-01-03 15:00:00 -44.0   85.0    33.0
2022-01-03 15:05:00 85.0    -84.0   -83.0
2022-01-03 15:10:00 -77.0   29.0    -63.0

Efficiently Iterate Through Datetime Index to Set Time Boundaries

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in VECTORIZATION

Related Questions in TIMEDELTA

Related Questions in DATETIMEINDEX

Trending Questions

Popular # Hahtags

Popular Questions