Load balancing strategy for data reading and writing cluster

485 Views Asked by At

I am working on a java application cluster that is capable of migrating data from one data source to another ( i.e. Database to File, File to Database, Oracle to MSSQL, vice versa etc. )

Using JGroups for task distribution and cluster management purposes. An example configuration would be 4 servers having an instance of this identical application ( Presumably all servers with same hardware configuration ).

Before asking my question, i should mention about the nature of the tasks i want to distribute:

For example, one of the nodes might be asked to copy an Oracle table to another database. In this case, i resolve rowid ranges of this table to query data in parallel with different connections. ( DIY parallelism ). So each node gets 1/4 of source data and write it to target database in parallel.

                   Buffer On Node    
                  ---> Node 1 --->
                  ---> Node 2 --->
 Source Table                        Target Table
                  ---> Node 3 --->
                  ---> Node 4 --->

I would have distributed each branch(rowid range) of a single table copy to one of the nodes if this was the only case.

But another example ( which confuses me ), there might be complex queries, or views queried to be copied to a target table. Can't distribute this data since no data block or rowid is involved. So there might be requests i will not be able to distribute. For this one i read query with a single connection on a single node and write it to the target table in parallel ( lets say parallel degree of 4, i am aware writing in parallel might only be useful if read is faster than write, or i think so. ). Or maybe an excel file was to be copied to a table a database platform.

                                                 --->     
                                                 --->
Source( Queried data, file, etc. )  ---> Node 1          Target
                                                 --->
                                                 --->

So, i am not quite sure how to distribute the tasks say when there is two queries requested to be copied and 3 tables to be copied.

One solution i thought was to make an estimation for average record size of each data request and calculate max data load on the server at an instant.

If a table is to be copied in parallel with average record size of 100 bytes and 1000 records of buffer size each node would have 100 * 1000 = 100000 bytes at most this task, until data copy is finished.

On the other hand if there is another file request with, say an average record of 200 bytes that would be copied on a single node and the load would be 200 * 1000 = 200000 bytes.

So i would be able to distribute a new task judging by the current loads on the nodes.

This solution i thought is highly hypothetical :) So my questions are:

  • Would it be effective to estimate load based on average records size ?
  • What kind of difficulties would i face to calculate average record size ?
  • Is there any other solution can you think of ? What is the best way to balance the load with this kind of I/O bound task distributions ?

Hope this did not sound too complicated.

0

There are 0 best solutions below