Background
Need to process close to 60 million records from mysql DB using JDBC. The interaction here is just "selection" from DB based on a primary key, id. The fields fetched are id, name and description. Some other asynchronous threads will process fetched data and do not involve DB interaction.
Now this program takes a range of ids as arguments e.g. 1 to 100, 101 to 200 etc. and multiple instances of this program will run on separate machines. Thus each program instance will process different range of ids.
This is a one time job to create backfill data but will hit production DB for mere selection.
Observation
mysql> SHOW STATUS WHERE `variable_name` = 'Threads_connected';
+-------------------+-------+
| Variable_name | Value |
+-------------------+-------+
| Threads_connected | 12 |
+-------------------+-------+
When I used the range 25 to 95, it had only 20 records but the number of connections observed were 12 and they remained so while the entire processing was going on.
Code
public void processRange(String fromid, String toid) {
if (!fromid.isEmpty() && !toid.isEmpty()) {
try {
Class.forName("com.mysql.jdbc.Driver");
this.connection = DriverManager.getConnection(this.url, this.user, this.password);
this.statement = connection.prepareStatement("SELECT id, name, contents FROM sometable WHERE id >= " + fromid + " and id <= " + toid);
this.results = statement.executeQuery();
while(results.next()) {
if(name != null && !name.isEmpty() && contents != null && !contents.isEmpty()) {
//PROCESS using async futures
}
}
} catch (SQLException sqle) {
sqle.printStackTrace();
} catch (Exception exp) {
exp.printStackTrace();
} finally {
try {
if(results != null) {
this.results.close();
}
if(statement != null) {
this.statement.close();
}
if(connection != null) {
this.connection.close();
}
} catch (SQLException sqle) {
sqle.printStackTrace();
}
}
}
}
Problem
This is pretty simple JDBC code. Now if I run multiple instances of this process as mentioned in the Background, how do I make sure that DB does not go down with "Too many connections" error. Is there any way to just fetch the data, close the connection and process async "offline" so as to release connections ASAP.
On the other hand, I think looping through the range and then fetching data for each id is the worst thing to solve this problem as against fetching in a batch. Please comment if anybody thinks otherwise.
UPDATE
With current speed it will take months to finish the task...! Any other parallelism method to make the entire task faster ?
I am also trying to use MAP REDUCE to achieve the same, though I am not sure if that is a valid use case.
The above observation was based on a local machine. When I tested it on a relatively powerful box, the connections were very less as compared to max number. I am running 5 instances of this process on same machine looking at a range of 10000 ids (total 50K) processed by resultset. Still I have 17 connections and max allowed are 500. But it takes a lot of time to finish as such and CPU usage shoots to 99%. So need some tweaking to make it run faster or distribute to other machines.
The problem was not with DB. The bottle neck was FTP. The images were being created based on info fetched from DB and it was supposed to be FTPed to another host. So instead putting each image to FTP server, multiple images were zipped together and then uploaded.