Is there any way to improve to_sql performance in Python?

306 Views Asked by At

I have a loop that goes through each file (xlsx and csv) on a directory, reads them, append them and concatenate it in list. That list is them sent to a database. It works, but hen it comes to bigger files, it can take a lot of time. I figure this is the part where I need to improve my code:

        xls.to_sql(table, con=engine, if_exists='append', index=False)

I've been testing different chunksizes, but how can I choose the optimal chunksize when there's all kinds of files' sizes?

I've also been studying parallelization and was wondering if anyone could help me figure out how to improve this.

EDIT: By the way, I've also followed this blogpost to improve my connection to the database and all. And I've noticed a lot of perfomance improvements. But to_sql is still lacking performance.

https://towardsdatascience.com/dramatically-improve-your-database-inserts-with-a-simple-upgrade-6dfa672f1424

1

There are 1 best solutions below

5
NYC Coder On

This is what we did in our project:

chunk_size = 2100 // len(df.columns) - 1

And used method = 'multi' in to_sql as well. This has worked out pretty well for us. method = 'multi' speeds things up quite a bit as it takes multiple insert statements at once.