I have a problem where I have allot of data coming into a system, as it comes in it gets stored on the disk and then goes though a pipeline where it is transformed.
Each step in the pipeline may be CPU bound or not, but my overall problem becomes memory, which means I have to set a limit on how much I process.
After trying an initial implementation with BoundedChannels, it was easy to see that the memmory consumption grew rapidly beyond what would be available (My development machine has allot more memmory so here it was ok, but it would crash live)...
The overall concept was:
Multiple Producers -> BoundedChannel -> Multiple Transform Consumers -> BoundedChannel ... etc..
The problem is that if the channels e.g. allow 100.000 elements, when the first gets filled, all is OK, but obviously the limit for the entire pipeline increases as we go though steps, with 3 steps the limit is suddenly 100.000 * 3...
Ofc. we could just then say the limit is 33.000 which would be ~100.000 in the entire pipeline all together, however I would wish to allow to balance the limit over all channels so we won't have to adjust it if we add more steps or tweak it depending on where it's most intensive...
I can't quite figure out if the Data flow framework that would allow such sort of pipeline?
(Sorry for this being quite abstract, An alternative may very well be that I have to design this from the bottom up)