Question:
Hi,
When searching for methods to make a selection of a dataframe (being relatively unexperienced with Pandas), I had the following question:
What is faster for large datasets - .isin() or .query()?
Query is somewhat more intuitive to read, so my preferred approach due to my line of work. However, testing it on a very small example dataset, query seems to be much slower.
Is there anyone who has tested this properly before? If so, what were the outcomes? I searched the web, but could not find another post on this.
See the sample code below, which works for Python 3.8.5.
Thanks a lot in advance for your help!
Code:
# Packages
import pandas as pd
import timeit
import numpy as np
# Create dataframe
df = pd.DataFrame({'name': ['Foo', 'Bar', 'Faz'],
'owner': ['Canyon', 'Endurace', 'Bike']},
index=['Frame', 'Type', 'Kind'])
# Show dataframe
df
# Create filter
selection = ['Canyon']
# Filter dataframe using 'isin' (type 1)
df_filtered = df[df['owner'].isin(selection)]
%timeit df_filtered = df[df['owner'].isin(selection)]
213 µs ± 14 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# Filter dataframe using 'isin' (type 2)
df[np.isin(df['owner'].values, selection)]
%timeit df_filtered = df[np.isin(df['owner'].values, selection)]
128 µs ± 3.11 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# Filter dataframe using 'query'
df_filtered = df.query("owner in @selection")
%timeit df_filtered = df.query("owner in @selection")
1.15 ms ± 9.35 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The best test in real data, here fast comparison for 3k, 300k,3M rows with this sample data:
If check docs:
Conclusion - The best test in real data, because depends of number of rows, number of matched values and also by length of list
selection
.