How to write same number of files as Spark partitions

377 Views Asked by At

I'm trying to write from Spark into a single file on S3. Doing something like this

dataframe.repartition(1)
      .write
      .option("header", "true")
      .option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ")
      .option("maxRecordsPerFile", batchSize)
      .option("delimiter", delimiter)
      .option("quote", quote)
      .format(format)
      .mode(SaveMode.Append)
      .save(tempDir)

Now as I'm forcing the partition to be 1 before writing (I also tried coalesce), I expected that one output file to be written. But it is not, this I have as many files as the number of partitions before writing.

enter image description here

How can I make sure that there is a single output file on S3?

1

There are 1 best solutions below

0
bachr On

It turns out the maxRecordsPerFile was causing this, removing it I got one final file.