How to search a pandas DataFrame for the first row satisfying set of conditions without reading the rest of the rows?

246 Views Asked by MHDMYZ At 02 July 2025 at 06:40

I have a huge DataFrame (~4 million rows) and I need to search it for a row which has specific columns values for about a million time. Based on the conditions governing my problem, there is only one true answer (one row) for each query. So as soon as the search finds the first result, there's no need to continue the search. But as we know df.loc[df['column']==value] has to read all the data every time! even if the first row satisfies the search conditions, other 4 million rows has to be read and evaluated! which creates a huge overhead for the search. Is there a way to get the first row satisfying the search conditions without reading and evaluating the rest of the rows?

Original Q&A

There are 1 best solutions below

Amir saleem On 02 December 2020 at 00:51

firstly you have to set that column as index (as you said you have no duplicated value). then change your data frame to dictionary, and then search your value.

In [1]: import numpy as np, pandas as pd
   ...: 
   ...: np.random.seed(4)
   ...: h = 100
   ...: small_df = pd.DataFrame(np.random.randint(1,1000000,h).reshape(h//4,4))
   ...: small_df = small_df.set_index(3)
   ...: small_df.index = small_df.index.astype(str)
   ...: small_df = small_df.loc[small_df.index.drop_duplicates()]
   ...: small_df = small_df.T.to_dict()
   ...: 
   ...: 
   ...: np.random.seed(4)
   ...: h = h*100000
   ...: big_df = pd.DataFrame(np.random.randint(1,1000000000,h).reshape(h//4,4))
   ...: big_df = big_df.set_index(3)
   ...: big_df.index = big_df.index.astype(str)
   ...: big_df = big_df.T.to_dict()
/home/amir/.local/bin/ipython3:17: UserWarning: DataFrame columns are not unique, some columns will be omitted.
len(small_df)
In [2]: len(small_df)
Out[2]: 25

In [3]: len(big_df)
Out[3]: 2496856

In [6]: %time small_df['890932']
CPU times: user 4 µs, sys: 0 ns, total: 4 µs
Wall time: 7.15 µs
Out[6]: {0: 962341, 1: 751580, 2: 181308}

In [7]: %time big_df  ['115865608']
CPU times: user 5 µs, sys: 0 ns, total: 5 µs
Wall time: 7.39 µs
Out[7]: {0: 448609773, 1: 372731489, 2: 452798904}

How to search a pandas DataFrame for the first row satisfying set of conditions without reading the rest of the rows?

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in DATAFRAME

Related Questions in SEARCH

Related Questions in PANDAS-LOC

Trending Questions

Popular # Hahtags

Popular Questions