Optimizing MySQL InnoDB insert through PHP

697 Views Asked by At

I've a Cronjob script, written in PHP with following requirements:

  • Step 1 (DB server 1): Get some data from multiple tables (We have lot of data here)
  • Step 2 (Application server): Perform some calculation
  • Step 3 (DB Server 2): After calculation, insert that data in another database(MySQL)/table(InnoDB) for reporting purpose. This table contains 97 columns, actually different rates, which can not be normalized further. This is different physical DB server and have only one DB.

Script worked fine during development but on production, Step 1 returned approx 50 million records. Result, as obvious, script run for around 4 days and then failed. (Rough estimation, with current rate, it would have taken approx 171 days to finish)

Just for note, We were using prepared statements and Step 1 is getting data in bunch of 1000 records at a time.

What we did till now

Optimization Step 1: Multiple values in insert & drop all indexes

Some tests showed insert (Step 3 above) is taking maximum time (More then 95% time). To optimize, after some googling, we dropped all indexes from table, and instead of one insert query/row, we are not having one insert query/100 rows. This gave us a bit faster insert but still, as per rough estimate, it will take 90 days to run cron once, and we need to run it once every month as new data will be available every month.

Optimization step 2, instead of writing to DB, write to csv file and then import in mysql using linux command.

This step seems not working. Writing 30000 rows in CSV file took 16 minutes and we still need to import that CSV file in MySQL. We have single file handler for all write operations.

Current state

It seems I'm now clueless on what else can be done. Some key requirements:

  • Script need to insert approx 50,000,000 records (will increase with time)
  • There are 97 columns for each records, we can skip some but 85 columns at the minimum.
  • Based on input, we can break script into three different cron to run on three different server but insert had to be done on one DB server (master) so not sure if it will help.

However:

  • We are open to change database/storage engine (including NoSQL)
  • On production, we could have multiple database servers but insert had to be done on master only. All read operations can be directed to slave, which are minimal and occasional (Just to generate reports)

Question

I don't need any descriptive answer but can someone in short suggest what could be possible solution. I just need some optimization hint and I'll do remaining R&D.

We are open for everything, change database/storage engine, Server optimization/ multiple servers (Both DB and application), change programming language or whatever is best configuration for above requirements.

Final expectation, cron must finish in maximum 24 hours.

Edit in optimization step 2

To further understand why generating csv is taking time, I've created a replica of my code, with only necessary code. That code is present on git https://github.com/kapilsharma/xz

Output file of experiment is https://github.com/kapilsharma/xz/blob/master/csv/output500000_batch5000.txt

If you check above file, I'm inserting 500000 records and getting 5000 records form database at a time, making loop running 100 times. Time taken in first loop was 0.25982284545898 seconds but in 100th loop was 3.9140808582306. I assume its because of system resource and/or file size of csv file. In that case, it becomes more of programming question then DB optimization. Still, can someone suggest why it is taking more time in next loops?

If needed, whole code is committed except csv files and sql file generated to create dummy DB as these files are very big. However they can be easily generated with code.

3

There are 3 best solutions below

0
On

This is a very broad question. I'd start by working out what the bottleneck is with the "insert" statement. Run the code, and use whatever your operating system gives you to see what the machine is doing.

If the bottleneck is CPU, you need to find the slowest part and speed it up. Unlikely, given your sample code, but possible.

If the bottleneck is I/O or memory, you're almost certainly going to need either better hardware, or a fundamental re-design.

The obvious way to re-design this is to find a way to handle only deltas in the 50M records. For instance, if you can write to an audit table whenever a record changes, your cron job can look at that audit table and pick out any data that was modified since the last batch run.

0
On

I had a mailer cron job on CakePHP, which failed merely on 600 rows fetch and send email to the registered users. It couldn't even perform the job in batch operations. We finally opted for mandrill and since then it all went well.

I'd suggest (considering it a bad idea to touch the legacy system in production) :

  • Schedule a mirco solution in golang or node.js considering performance benchmarks, as database interaction is involved - you'll be fine with any of these. Have this micro solution perform the cron job. (Fetch + Calculate)
  • Reporting from NoSQL will be challenging, so you should try out using available services like Google Big Query. Have the cron job store data to google big query and you should get a huge performance improvement even in generating reports.

or

  • With each row inserted into your original db server 1, set up a messaging mechanism which performs the operations of cron job everytime an insert is made (sort of trigger) and store it into your reporting server. Possible services you can use are : Google PubSub or Pusher. I think per insert time consumption will be pretty less. (You can also use a async service setup which does the task of storing into the reporting database).

Hope this helps.

0
On

Using OFFSET and LIMIT to walk through a table is O(N*N), that is much slower than you want or expected.

Instead, walk through the table "remembering where you left off". It is best to use the PRIMARY KEY for such. Since the id looks like an AUTO_INCREMENT without gaps, the code is simple. My blog discusses that (and more complex chunking techniques).

It won't be a full 100 (500K/5K) times as fast, but it will be noticeably faster.