Architecture for DataPipline Best Practice

99 Views Asked by At

Given:

Somekind of data import of an external source. The data can be read in chunks of defined size. For example 10 items at once. For Example Emails.

Now each chunk has to pass some steps which transform the data, filters items out and so on.

There is no relation between the chunks or the items of the chunks. Also the order of processing isn't important

Question

Now I'm thinking about what kind of structure would be the right if I do this with akka to have the best parallization and performance.

1.) Would I more likely create all actors as a chain of children. So that the importActor has a Child which is the first step. and the first step has the second step as child and so son.
Or more likely have one ImportActor which has all steps and calls one after the Other?

2.)Now one actor can now only process one message a time. To Parallize the import process I think about using the PipeTo mechanism. is this a good idea? are there better options?

3.) Would I create for each chunk an actor like "Import_Chunk1_Actor" or would i push all messages to the single "ImportActor" ?

1

There are 1 best solutions below

1
On BEST ANSWER

If you asked a question like this anywhere else on SO you would get hammered. Its a bit vague and easy to be opinionated so will try to be objective

Id say just try it several ways without spending time on the code that does the work. Its really quick to do scaffolding style work.

1) from what you have described, you would have a input and then a number of actors representing the '10 items at once' these would probably just be behind a router though. So whilst developing you wouldn't worry about there being 10, just do it for one and then later use config and tiny tweak to scale it up - as you suggest you may only have 1 actor doing all the work if you use tasks. Then within each of these i would have the processes at the same level. so much depends on whether you have any state at this point. You could use the become semantic to lock the actor into a specific workflow, you could just handle messages that kick of the next state in a task which continue to tell the actor to do the next stage. I think your suggestion of child actors is the least appealing.

  1. An actor only processes one message at a time, so you reduce the amount of time it is processing if you want it to have high throughput. You either do this via a task or passing off to another actor eg a worker or an aggregator. PipeTo is useful in scenarios where the thing producing the message in the task is producing the message of the correct type to send to another actor and you do not want to do anything with it. Its just a continuation. Nothing wrong with it, the parts of the actor system where you eventually do some work you probably are going to wrap in a task and if you are able to, then use it. Some form of continuation is better than an actor blocking - but if the actor will only ever be doing one thing at once does it matter? Thread being blocked is a thread being blocked. The thing to bear in mind with tasks is that you started to use something like akka probably because of the pitfalls of task/concurrent based programming - you can easily buy all that back.

  2. when you get to this point it will be obvious. you would probably use a router if you had multiple actors - or if you start firing up lots of tasks you can probably do most of this with 1-2 actors and several messages chaining. As to which is better - i know of 3 people who use actor systems who could argue all day about the relative merits. You could just use 1 actor that handles messages and fires off tasks if you eliminate all state in an actor. You could have 2-3 layers, you might have something to aggregate the 3 tasks and/or the 10 workers. World is your oyster

The point is that it all depends on requirements that are not stated.