How to write same number of files as Spark partitions

377 Views Asked by bachr At 13 August 2020 at 22:49

I'm trying to write from Spark into a single file on S3. Doing something like this

dataframe.repartition(1)
      .write
      .option("header", "true")
      .option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ")
      .option("maxRecordsPerFile", batchSize)
      .option("delimiter", delimiter)
      .option("quote", quote)
      .format(format)
      .mode(SaveMode.Append)
      .save(tempDir)

Now as I'm forcing the partition to be 1 before writing (I also tried coalesce), I expected that one output file to be written. But it is not, this I have as many files as the number of partitions before writing.

How can I make sure that there is a single output file on S3?

Original Q&A

There are 1 best solutions below

bachr On 13 August 2020 at 22:57

It turns out the maxRecordsPerFile was causing this, removing it I got one final file.

How to write same number of files as Spark partitions

There are 1 best solutions below

Related Questions in SCALA

Related Questions in APACHE-SPARK

Related Questions in AMAZON-S3

Related Questions in APACHE-SPARK-SQL

Related Questions in SPARK-CSV

Trending Questions

Popular # Hahtags

Popular Questions