Slow WHERE IN SQL query using pandas python

799 Views Asked by At

I am trying to speed up a sqlite3 query, currently it is quite slow.

I have a table of ~22 million rows with two columns (uid, info)

In python, I then find a list of ~10,000 uid values that correspond to uid's in the table above. (randomly generated numbers for this post)

import numpy as np
import pandas as pd
import sqlite3

conn = sqlite3.connect('mydb.db')

uids = np.random.random(10000)
uids = list(map(int,list(map(round,uids*2000000))))

sql_query = 'SELECT * FROM mytable WHERE uid IN (' + ','.join(map(str, uids)) + ')'
sqlpd = pd.read_sql_query(sql_query,conn)

The query works, but it is quite slow (~15 seconds). Wondering how I can speed this up, I am guessing it is the WHERE ... IN methodology that is causing the issue

2

There are 2 best solutions below

0
On BEST ANSWER

Not sure why, but I dropped the table and rebuilt with the same code/data. Now it is running in 0.1 seconds.

Problem solved, not sure why :/

1
On

Implementing @Aprillion's idea #1:

pd.DataFrame({'uid':uids}).to_sql('tmp', conn, index=False, if_exists='replace')

now we can do this:

sql_query = 'SELECT a.* FROM mytable a JOIN tmp b ON a.uid = b.uid'
sqlpd = pd.read_sql_query(sql_query,conn)