Insert large Excel file into database

2k Views Asked by At

I'm using Spout for reading an Excel file of over 500.000 records (with 7 columns each, not too much info).

The problem is my script is getting timmed out. I've tried uploading this limits and it gets better, but so far I couldn't make a complete insert, just partial ones arround 50.000 rows.

This is not an option for me. Is there any way to split this Excel file but on the code? What I see is that manipulating the file even if it's not inserting into database is already slow and times out.

So... any advice?

Thanks!

3

There are 3 best solutions below

0
On

Reading a file with 3,500,000 cells is not going to be fast, no matter what. It will take at least a minute, if run on powerful hardware and if the Excel file uses inline strings.

So here are the options you have:

  1. If you control the creation of the Excel file you're reading, make sure it uses inline strings (that's the default behavior if you use Spout). This will speed up the reading dramatically. The slowness you mentioned even if you only read the first 2 lines is due to this. When not using inline strings, Spout needs to pre-process the file containing the cell values first, as they are referenced in another place. With inline strings, Spout can skip this expensive step and does a real streaming.
  2. Increase the time limit setting to leave more time to your script to finish its processing (set_time_limit)
  3. Batch your DB inserts: instead of inserting rows one by one, you should insert them 1000 by 1000 (or more). Each connection to the DB takes some time so limiting them is a good idea.

Splitting the file may work but needs to be done ahead of time (not in the same script, otherwise it will just add time to the total processing time...).

Hope that helps!

0
On

The best way is peform this job in background by these step:

  1. Upload the Excel file to server. Update the Import Job table with status: 0: Waiting
  2. Setup and running a crob job to check this table and perform the import job if there is a field with status 0. Update the status to 1: Processing. Perform the import job service (batch import will be good solution). 3. Update the status to 2. Completed if the import is finished successfully.
  3. if error...
0
On

You can try calling set_time_limit() repeatedly, for example after every row you insert. It resets the time limit each time you call it. If your server administrator has set up a global time limit this won't allow you to exceed that, however.

But inserting half a million rows one by one into an InnoDB table in MySQL is inherently slow because it needs to do an autocommit after every row.

If you do the insert in batches you'll gain a lot of speed. For example, your are probably doing something like this now:

  INSERT INTO table (col1, col2, col3) VALUES (1, 'baker', 'charlie');
  INSERT INTO table (col1, col2, col3) VALUES (2, 'delta', 'echo');
  INSERT INTO table (col1, col2, col3) VALUES (3, 'foxtrot', 'golf');
  INSERT INTO table (col1, col2, col3) VALUES (4, 'hotel', 'india');
  INSERT INTO table (col1, col2, col3) VALUES (5, 'lima', 'mike');

Instead do this:

  INSERT INTO table (col1, col2, col3) VALUES 
     (1, 'baker', 'charlie'),
     (2, 'delta', 'echo'),
     (3, 'foxtrot', 'golf'),
     (4, 'hotel', 'india'),
     (5, 'lima', 'mike');

That way you'll incur the commit overhead on MySQL for every five rows rather than for every one. Notice that you can put many rows into a single INSERT, not just five. MySQL's only limit on query length can be found with SHOW VARIABLES LIKE 'max_allowed_packet';.

Of course this a little more complex to program, but it's much faster.