Why AtomicInteger based Stream solutions are not recommended?

2.7k Views Asked by At

Say I have this list of fruits:-

List<String> f = Arrays.asList("Banana", "Apple", "Grape", "Orange", "Kiwi");

I need to prepend a serial number to each fruit and print it. The order of fruit or serial number does not matter. So this is a valid output:-

4. Kiwi
3. Orange
1. Grape
2. Apple
5. Banana

Solution #1

AtomicInteger number = new AtomicInteger(0);

String result = f.parallelStream()
        .map(i -> String.format("%d. %s", number.incrementAndGet(), i))
        .collect(Collectors.joining("\n"));

Solution #2

String result = IntStream.rangeClosed(1, f.size())
        .parallel()
        .mapToObj(i -> String.format("%d. %s", i, f.get(i - 1)))
        .collect(Collectors.joining("\n"));

Question

Why is solution #1 a bad practice? I have seen at a lot of places that AtomicInteger based solutions are bad (like in this answer), specially in parallel stream processing (that's the reason I used parallel streams above, to try run into issues).

I looked at these questions/answers:-
In which cases Stream operations should be stateful?
Is use of AtomicInteger for indexing in Stream a legit way?
Java 8: Preferred way to count iterations of a lambda?

They just mention (unless I missed something) "unexpected results can occur". Like what? Can it happen in this example? If not, can you provide me an example where it can happen?

As for "no guarantees are made as to the order in which the mapper function is applied", well, that's the nature of parallel processing, so I accept it, and also, the order doesn't matter in this particular example.

AtomicInteger is thread safe, so it shouldn't be a problem in parallel processing.

Can someone provide examples in which cases there will be issues while using such a state-based solution?

3

There are 3 best solutions below

0
On BEST ANSWER

Note also that attempting to access mutable state from behavioral parameters presents you with a bad choice with respect to safety and performance; if you do not synchronize access to that state, you have a data race and therefore your code is broken, but if you do synchronize access to that state, you risk having contention undermine the parallelism you are seeking to benefit from. The best approach is to avoid stateful behavioral parameters to stream operations entirely; there is usually a way to restructure the stream pipeline to avoid statefulness.

Package java.util.stream, Stateless behaviors

From the perspective of thread-safety and correctness, there is nothing wrong with solution 1. Performance (as an advantage of parallel processing) might suffer, though.


Why is solution #1 a bad practice?

I wouldn't say it's a bad practice or something unacceptable. It's simply not recommended for the sake of performance.

They just mention (unless I missed something) "unexpected results can occur". Like what?

"Unexpected results" is a very broad term, and usually refers to improper synchronisation, "What's the hell just happened?"-like behaviour.

Can it happen in this example?

It's not the case. You are likely not going to run into issues.

If not, can you provide me an example where it can happen?

Change the AtomicInteger to an int*, replace number.incrementAndGet() with ++number, and you will have one.


*a boxed int (e.g. wrapper-based, array-based) so you can work with it within a lambda

0
On

Well look at what the answer from Stuart Marks here - he is using a stateful predicate.

The are a couple of potential problems, but if you don't care about them or really understand them - you should be fine.

First is order, exhibited under the current implementation for parallel processing, but if you don't care about order, like in your example, you are ok.

Second one is potential speed AtomicInteger will be times slower to increment that a simple int, as said, if you care about this.

Third one is more subtle. Sometimes there is no guarantee that map will be executed, at all, for example since java-9:

 someStream.map(i -> /* do something with i and numbers */)
           .count();

The point here is that since you are counting, there is no need to do the mapping, so its skipped. In general, the elements that hit some intermediate operation are not guaranteed to get to the terminal one. Imagine a map.filter.map situation, the first map might "see" more elements compared to the second one, because some elements might be filtered. So it's not recommended to rely on this, unless you can reason exactly what is going on.

In your example, IMO, you are more than safe to do what you do; but if you slightly change your code, this requires additional reasoning to prove it's correctness. I would go with solution 2, just because it's a lot easier to understand for me and it does not have the potential problems listed above.

1
On

Case 2 - In API notes of IntStream class returns a sequential ordered IntStream from startInclusive (inclusive) to endInclusive (inclusive) by an incremental step of 1 kind of for loop thus parallel stream are processing it one by one and providing the correct order.

 * @param startInclusive the (inclusive) initial value
 * @param endInclusive the inclusive upper bound
 * @return a sequential {@code IntStream} for the range of {@code int}
 *         elements
 */
public static IntStream rangeClosed(int startInclusive, int endInclusive) {

Case 1 - It is obvious that the list will be processed in parallel thus the order will not be correct. Since mapping operation is performed in parallel, the results for the same input could vary from run to run, due to thread scheduling differences thus no guarantees that different operations on the "same" element within the same stream pipeline are executed in the same thread also there is no guarantee how a mapper function is also applied to the particular elements within the stream.

Source Java Doc