I have a huge DataFrame (~4 million rows) and I need to search it for a row which has specific columns values for about a million time. Based on the conditions governing my problem, there is only one true answer (one row) for each query. So as soon as the search finds the first result, there's no need to continue the search. But as we know df.loc[df['column']==value] has to read all the data every time! even if the first row satisfies the search conditions, other 4 million rows has to be read and evaluated! which creates a huge overhead for the search. Is there a way to get the first row satisfying the search conditions without reading and evaluating the rest of the rows?

1

There are 1 best solutions below

0
On

firstly you have to set that column as index (as you said you have no duplicated value). then change your data frame to dictionary, and then search your value.

In [1]: import numpy as np, pandas as pd
   ...: 
   ...: np.random.seed(4)
   ...: h = 100
   ...: small_df = pd.DataFrame(np.random.randint(1,1000000,h).reshape(h//4,4))
   ...: small_df = small_df.set_index(3)
   ...: small_df.index = small_df.index.astype(str)
   ...: small_df = small_df.loc[small_df.index.drop_duplicates()]
   ...: small_df = small_df.T.to_dict()
   ...: 
   ...: 
   ...: np.random.seed(4)
   ...: h = h*100000
   ...: big_df = pd.DataFrame(np.random.randint(1,1000000000,h).reshape(h//4,4))
   ...: big_df = big_df.set_index(3)
   ...: big_df.index = big_df.index.astype(str)
   ...: big_df = big_df.T.to_dict()
/home/amir/.local/bin/ipython3:17: UserWarning: DataFrame columns are not unique, some columns will be omitted.
len(small_df)
In [2]: len(small_df)
Out[2]: 25

In [3]: len(big_df)
Out[3]: 2496856

In [6]: %time small_df['890932']
CPU times: user 4 µs, sys: 0 ns, total: 4 µs
Wall time: 7.15 µs
Out[6]: {0: 962341, 1: 751580, 2: 181308}

In [7]: %time big_df  ['115865608']
CPU times: user 5 µs, sys: 0 ns, total: 5 µs
Wall time: 7.39 µs
Out[7]: {0: 448609773, 1: 372731489, 2: 452798904}