I've a Cronjob script, written in PHP with following requirements:
- Step 1 (DB server 1): Get some data from multiple tables (We have lot of data here)
- Step 2 (Application server): Perform some calculation
- Step 3 (DB Server 2): After calculation, insert that data in another database(MySQL)/table(InnoDB) for reporting purpose. This table contains 97 columns, actually different rates, which can not be normalized further. This is different physical DB server and have only one DB.
Script worked fine during development but on production, Step 1 returned approx 50 million records. Result, as obvious, script run for around 4 days and then failed. (Rough estimation, with current rate, it would have taken approx 171 days to finish)
Just for note, We were using prepared statements and Step 1 is getting data in bunch of 1000 records at a time.
What we did till now
Optimization Step 1: Multiple values in insert & drop all indexes
Some tests showed insert (Step 3 above) is taking maximum time (More then 95% time). To optimize, after some googling, we dropped all indexes from table, and instead of one insert query/row, we are not having one insert query/100 rows. This gave us a bit faster insert but still, as per rough estimate, it will take 90 days to run cron once, and we need to run it once every month as new data will be available every month.
Optimization step 2, instead of writing to DB, write to csv file and then import in mysql using linux command.
This step seems not working. Writing 30000 rows in CSV file took 16 minutes and we still need to import that CSV file in MySQL. We have single file handler for all write operations.
Current state
It seems I'm now clueless on what else can be done. Some key requirements:
- Script need to insert approx 50,000,000 records (will increase with time)
- There are 97 columns for each records, we can skip some but 85 columns at the minimum.
- Based on input, we can break script into three different cron to run on three different server but insert had to be done on one DB server (master) so not sure if it will help.
However:
- We are open to change database/storage engine (including NoSQL)
- On production, we could have multiple database servers but insert had to be done on master only. All read operations can be directed to slave, which are minimal and occasional (Just to generate reports)
Question
I don't need any descriptive answer but can someone in short suggest what could be possible solution. I just need some optimization hint and I'll do remaining R&D.
We are open for everything, change database/storage engine, Server optimization/ multiple servers (Both DB and application), change programming language or whatever is best configuration for above requirements.
Final expectation, cron must finish in maximum 24 hours.
Edit in optimization step 2
To further understand why generating csv is taking time, I've created a replica of my code, with only necessary code. That code is present on git https://github.com/kapilsharma/xz
Output file of experiment is https://github.com/kapilsharma/xz/blob/master/csv/output500000_batch5000.txt
If you check above file, I'm inserting 500000 records and getting 5000 records form database at a time, making loop running 100 times. Time taken in first loop was 0.25982284545898 seconds but in 100th loop was 3.9140808582306. I assume its because of system resource and/or file size of csv file. In that case, it becomes more of programming question then DB optimization. Still, can someone suggest why it is taking more time in next loops?
If needed, whole code is committed except csv files and sql file generated to create dummy DB as these files are very big. However they can be easily generated with code.
This is a very broad question. I'd start by working out what the bottleneck is with the "insert" statement. Run the code, and use whatever your operating system gives you to see what the machine is doing.
If the bottleneck is CPU, you need to find the slowest part and speed it up. Unlikely, given your sample code, but possible.
If the bottleneck is I/O or memory, you're almost certainly going to need either better hardware, or a fundamental re-design.
The obvious way to re-design this is to find a way to handle only deltas in the 50M records. For instance, if you can write to an audit table whenever a record changes, your cron job can look at that audit table and pick out any data that was modified since the last batch run.