Naming the csv file in write_df

746 Views Asked by At

I am writing a file in sparkR using write_df, I am unable to specify the file name to this:

Code:

write.df(user_log0, path = "Output/output.csv",
         source = "com.databricks.spark.csv", 
         mode = "overwrite",
         header = "true")

Problem:

I expect inside the 'Output' folder a file called 'output.csv' but what happens is a folder called 'output.csv' and inside it called 'part-00000-6859b39b-544b-4a72-807b-1b8b55ac3f09.csv'

What am I doing wrong?

P.S: R 3.3.2, Spark 2.1.0 on OSX

1

There are 1 best solutions below

2
On BEST ANSWER

Because of the distributed nature of spark, you can only define the directory into which the files would be saved and each executor writes its own file using spark's internal naming convention.

If you see only a single file, it means that you are working in a single partition, meaning only one executor is writing. This is not the normal spark behavior, however, if this fits your use case, you can collect the result to an R dataframe and write to csv from that.

In the more general case where the data is parallelized between multiple executors, you cannot set the specific name for the files.