how to decide number of executors for 1 billion rows in spark

1.3k Views Asked by Surendiran Balasubramanian At 16 June 2025 at 23:44

We have a table which has one billion three hundred and fifty-five million rows. The table has 20 columns.

We want to join this table with another table which has more of less same number of rows.

How to decide number of spark.conf.set("spark.sql.shuffle.partitions",?)

How to decide number of executors and its resource allocation details?

How to find the amount of storage those one billion three hundred and fifty-five million rows will take in memory?

Original Q&A

There are 1 best solutions below

viggnah On 26 July 2022 at 06:58 BEST ANSWER

Like @samkart says, you have to experiment to figure out the best parameters since it depends on the size and nature of your data. The spark tuning guide would be helpful.

Here are some things that you may want to tweak:

spark.executor.cores is 1 by default but you should look to increase this to improve parallelism. A rule of thumb is to set this to 5.
spark.files.maxPartitionBytes determines the amount of data per partition while reading, and hence determines the initial number of partitions. You could tweak this depending on the data size. Default is 128 MB blocks in HDFS.
spark.sql.shuffle.partitions is 200 by default but tweak it depending on the data size and number of cores. This blog would be helpful.

how to decide number of executors for 1 billion rows in spark

There are 1 best solutions below

Related Questions in APACHE-SPARK

Related Questions in PYSPARK

Related Questions in SPARK-SHUFFLE

Trending Questions

Popular # Hahtags

Popular Questions