Pattern for batching and bulk processing

611 Views Asked by At

I've data queried from the db that looks a lot like the following

Job Site    File    List
-------------------------------
1   SiteA   file2.txt   2
2   SiteB   file2.txt   2
3   SiteA   file23.txt  23
4   SiteC   file2.txt   2
5   SiteB   file12.txt  12
6   SiteA   file29.txt  29
7   SiteB   file28.txt  28

I am supposed to initiate instances for each site (sites A, B & C), and then do processing with, i.e., for eg, for siteA, work on file2.txt, file23.txt & file29.txt. This "processing" can happen in some order, but it has to be one after the other (not simultaneous).

So my 1st task is the collate the sites - and create instances for each. How do I do this?

PS: I figured for the processing part I should use some sort of an iterator pattern...I prefer solutions in any modern complier language...like c#, vb, c++, etc...

3

There are 3 best solutions below

1
On

I'm not sure if I have understood what you want to achieve, but I think that what you need to do is:

  1. Get the distinct values of Site from your data
  2. For each of the values obtained in the previous step
    2.1. Instantiate the site
    2.2. Get all the associated values of File and process each one

In C# using LINQ, it would be something like this:

var siteNames=data.Select(d => d.Site).Distinct();
foreach(var siteName in siteNames) {
    var site=new Site(siteName); //Or use a factory method, a sites list, etc
    var files=data.Where(d => d.Site==site).Select(d => d.File);
    foreach(var file in Files) {
        site.ProcessFile(file);
    }
}
0
On

I would have a visitor object with a hash table which iterated over each entry.

For each entry: - if it didn't exist in the hash table, the visitor object would instantiate a site for it. - the object would then get the file and process it.

This would have the advantage that you would do the grouping and processing in one pass.

0
On

I endorse the idea of a worker for each site.

One additional problem - application state.

When your app starts does it care what happened before? Consider an application crash when a few of the files have been processed. On restart presumably the app will process the files again? Is that a problem?