Spark number of input partitions vs number of reading tasks

538 Views Asked by At

can someone explain to me how Spark determines the number of tasks when reading data? How is it related with the number of partitions of the input file and the number of cores?

I have a dataset (91MB) that is divided into 14 partitions (~6.5MB each). I did 2 tests:

  • test 1 - I loaded this dataset using 2 executors, 2 cores each
  • test 2 - I loaded this dataset using 4 executors, 2 cores each

Results:

  • test 1 - Spark created 5 tasks to read data (in each task ~18 MB was loaded)
  • test 2 - Spark created 7 tasks to read data (in each task ~13 MB was loaded)

I don't see any regularity here. I see that Spark somehow reduces the number of partitions, but by what rule? Could someone help?

1

There are 1 best solutions below

5
On

Spark would need to create total of 14 tasks to process the file with 14 partitions. Each task will be assigned to a partition per stage.

Now, if you have provided more resources, the spark will parallelize the tasks more. So you would see more tasks are started when the spark starts processing. However, those tasks will finish and a new set of tasks will start depending on the resources you have provided. Overall, the spark will fork 14 tasks to process the file.

Spark won't reduce the partitions of the file unless you repartition the file or coalesce.