I understand that in general Java streams do not split. However, we have an involved and lengthy pipeline, at the end of which we have two different types of processing that share the first part of the pipeline.
Due to the size of the data, storing the intermediate stream product is not a viable solution. Neither is running the pipeline twice.
Basically, what we are looking for is a solution that is an operation on a stream that yields two (or more) streams that are lazily filled and able to be consumed in parallel. By that, I mean that if stream A is split into streams B and C, when streams B and C consume 10 elements, stream A consumes and provides those 10 elements, but if stream B then tries to consume more elements, it blocks until stream C also consumes them.
Is there any pre-made solution for this problem or any library we can look at? If not, where would we start to look if we want to implement this ourselves? Or is there a compelling reason not to implemented at all?
You can implement a custom
Spliteratorin order to achieve such behavior. We will split your streams into the common "source" and the different "consumers". The custom spliterator then forwards the elements from the source to each consumer. For this purpose, we will use aBlockingQueue(see this question).Note that the difficult part here is not the spliterator/stream, but the syncing of the consumers around the queue, as the comments on your question already indicate. Still, however you implement the syncing,
Spliteratorhelps to use streams with it.With the approach used, the consumers work on each element in parallel, but wait for each other before starting on the next element.
Known issue If one of the consumers is "shorter" than the others (e.g. because it calls
limit()) it will also stop the other consumers and leave the threads hanging.Example