How can I reduce the spark tasks when I run a spark job

155 Views Asked by At

Here is my spark job stages: enter image description here

It has 260000 tasks because the job rely on more then 200000 small hdfs files, each file about
50MB and it is stored in gzip format

I tried using the following settings to reduce the tasks but it didn't work.

...
--conf spark.sql.mergeSmallFileSize=10485760 \
--conf spark.hadoopRDD.targetBytesInPartition=134217728 \
--conf spark.hadoopRDD.targetBytesInPartitionInMerge=134217728 \
...

Is it because file format is gzip that made it cannot be merged?

How can I do now if I want to reduce the job tasks?

0

There are 0 best solutions below