Spark number of input partitions vs number of reading tasks

530 Views Asked by Pawel At 18 August 2025 at 10:51

can someone explain to me how Spark determines the number of tasks when reading data? How is it related with the number of partitions of the input file and the number of cores?

I have a dataset (91MB) that is divided into 14 partitions (~6.5MB each). I did 2 tests:

test 1 - I loaded this dataset using 2 executors, 2 cores each
test 2 - I loaded this dataset using 4 executors, 2 cores each

Results:

test 1 - Spark created 5 tasks to read data (in each task ~18 MB was loaded)
test 2 - Spark created 7 tasks to read data (in each task ~13 MB was loaded)

I don't see any regularity here. I see that Spark somehow reduces the number of partitions, but by what rule? Could someone help?

Original Q&A

There are 1 best solutions below

Avishek Bhattacharya On 21 January 2023 at 15:21

Spark would need to create total of 14 tasks to process the file with 14 partitions. Each task will be assigned to a partition per stage.

Now, if you have provided more resources, the spark will parallelize the tasks more. So you would see more tasks are started when the spark starts processing. However, those tasks will finish and a new set of tasks will start depending on the resources you have provided. Overall, the spark will fork 14 tasks to process the file.

Spark won't reduce the partitions of the file unless you repartition the file or coalesce.

Spark number of input partitions vs number of reading tasks

There are 1 best solutions below

Related Questions in APACHE-SPARK

Related Questions in PYSPARK

Related Questions in APACHE-SPARK-SQL

Related Questions in APACHE-SPARK-SQL-REPARTITION

Trending Questions

Popular # Hahtags

Popular Questions