I have a very large Pandas Dataframe ~9 million records, 56 columns, which I'm trying to load into a MSSQL table, using Dataframe.to_sql(). Importing the whole Dataframe in one statement often leads to errors, relating to memory.
To cope with this, I'm looping through the Dataframe in batches of 100K rows, and importing a batch at a time. This way I no longer get any errors, but the code slows down dramatically after about 5.8 million records. The code I'm using:
maxrow = df.shape[0]
stepsize = 100000
for i in range(0, maxrow, stepsize):
batchstart = datetime.datetime.now()
if i == 0:
if_exists = 'replace'
else:
if_exists = 'append'
df_import = df.iloc[i:i+stepsize]
df_import.to_sql('tablename',
engine,
schema='tmp',
if_exists=if_exists,
index=False,
dtype=dtypes
)
I've timed the batches, and there is a clear breaking point in speed:
These results are basically the same for batches of 50k, 100k and 200k rows. it takes about 40 minutes to upload 6 million records, and another 2 hours and 20 minutes to upload the next 3 million.
My thinking was that it was either due to the size of the MSSQL table, or something being cached/saved after each upload. Because of that I've tried pushing the Dataframe to two different tables. I've also tried something like expunge_all() on the SQLALchemy session, after each upload. Both to no effect.
Manually stopping imports after 5 million records and restarting from 5 million with a new engine object also hasn't helped.
I'm all out of ideas what might be the cause of the process slowing down so drastically, and would really appreciate help.
UPDATE
As a last resort I've reversed the loop, uploading parts of the Dataframe starting at the highest index, looping down.
This has basically reversed the times per batch. So it seems it is the data itself that is different/bigger further down the Dataframe. Not the connection being overloaded or the SQL table getting to large.
Thanks to everyone trying to help, but it seems I need to go through the data to see what causes this.
