Assign a value to a Pandas DataFrame column, based on multiple conditions, fastest ways?

52 Views Asked by At

I am aware there are some other similar questions but I haven't found a timeit test with a fairly large dataset. I have a ~1.6M rows DF , and I want to use the fastest way to assign a column ['stag'] a 0-1 value depending on the index weekday and hour.

Is there a faster approach I am missing ? suggestions ?

import timeit
import statistics

...
...

df_backtest['weekday'] = df_backtest.index.weekday

def approach_1():
    df_backtest["stag"]=1
    df_backtest.loc[(df_backtest.index.hour>=2) & (df_backtest.index.hour<=10),"stag" ]=0
    df_backtest.loc[df_backtest.index.strftime('%w')==6,"stag" ]=0
    df_backtest.loc[df_backtest.index.strftime('%w')==0,"stag" ]=0


def approach_2():
    df_backtest['stag'] = 1
    df_backtest.loc[df_backtest.index.hour.isin(range(2, 11)), 'stag'] = 0
    df_backtest.loc[df_backtest.index.strftime('%w').isin(['6', '0']), 'stag'] = 0
    

def approach_3():
    df_backtest['stag'] = 1
    df_backtest.loc[df_backtest.index.hour.isin(range(2, 11)), 'stag'] = 0
    df_backtest.loc[df_backtest['weekday'].isin(['6', '0']), 'stag'] = 0


def approach_4():
    df_backtest['stag'] = 1
    df_backtest['stag'] = df_backtest['stag'].where((df_backtest.index.hour < 2) | (df_backtest.index.hour > 10))
    df_backtest['stag'] = df_backtest['stag'].where(~df_backtest['weekday'].isin(['6', '0']))

num_repeats = 10
num_loops = 5

print(f'Approach 1: {statistics.mean(timeit.repeat(approach_1, number=num_loops, repeat=num_repeats)) / num_loops} seconds per loop')
print(f'Approach 2: {statistics.mean(timeit.repeat(approach_2, number=num_loops, repeat=num_repeats)) / num_loops} seconds per loop')
print(f'Approach 3: {statistics.mean(timeit.repeat(approach_3, number=num_loops, repeat=num_repeats)) / num_loops} seconds per loop')
print(f'Approach 4: {statistics.mean(timeit.repeat(approach_4, number=num_loops, repeat=num_repeats)) / num_loops} seconds per loop')
print('Shape of DF:',df_backtest.shape)

output:

Approach 1: 4.617 seconds per loop
Approach 2: 2.256 seconds per loop
Approach 3: 0.087 seconds per loop
Approach 4: 0.106 seconds per loop
Shape of DF: (1605144, 7)

Hence, it seems approach_3 is the fastest, I assume because it does not use the overhead of 'where' nor managing strings. Any other approach is welcome. Thanks

0

There are 0 best solutions below