Is writing to database done by driver or executor in spark cluster

997 Views Asked by At

I have a spark cluster setup with 1 master node and 2 worker nodes. I am running a pyspark application in this spark standalone cluster where I have a job to write the transformed data into Mysql database.

So, I have a question here whether writing to database is done by driver or executor? Because when writing to a textfile, it's done by driver since my output file gets created in driver

Updated

Adding below the code I have used to write to a text file

from pyspark import SparkConf,SparkContext
if __name__ =="__main__":
    sc = SparkContext(master = "spark://IP:PORT",appName='word_count_application')
    words = sc.textFile("book_2.txt")
    word_count = words.flatMap(lambda a : a.split(" ")).map(lambda a : (a,1)).reduceByKey(lambda a,b : a+b)
    word_count.saveAsTextFile("book2_output.txt")
2

There are 2 best solutions below

3
On

If the writing is done using dataset/datafame api like this:

df.write.csv("...")

Then it's done by the executors, that why in spark we have multiple files in the output because each executor will write each partition defined inside it.

The driver is used for scheduling work across the executors, and not for doing the actual work ( reading, transforming and writing) which will be done by the executors

0
On

saveAsTextFile() is distributed, each executor is writing files. Your driver will never write any files since, as @Abdennacer Lachiheb already mentioned, it is responsible for scheduling, the Spark UI and more.

Your path is referring to a local file system, so your files are not getting saved on your driver, but on the machine your driver runs. The path could also be an object storage like S3 or HDFS.