Use Spark coalesce without decreasing earlier operations parallelism

37 Views Asked by At

Let’s say you had a parallelism of 1000, but you only wanted to write 10 files at the end:

load().map(…).filter(…).coalesce(10).save()

However, Spark’s will effectively push down the coalesce operation to as early a point as possible, so this will execute as:

load().coalesce(10).map(…).filter(…).save()

I know I can use an action (between the transformations and the coalesce) to workaround this behavior but I don't want to perform an action.

The question is how to write fewer files without changing the parallelism (except for the file writing of course)?

BTW, I know I can use repartition but I prefer something without shuffling which is faster like coalesce.

0

There are 0 best solutions below