Why so many Parquet files created? Can we not limit Parquet output files?

2.2k Views Asked by At

Why so many Parquet files created in sparkSql? Can we not limit Parquet output files ?

1

There are 1 best solutions below

1
On

in general when you write to parquet it will write one (or more depending on various options) files per partition. If you want to reduce the number of files you can call coalesce on the dataframe before writing. e.g.:

df.coalesce(20).write.parquet(filepath)

Of course if you have various options (e.g. partitionBy) the number of files can increase dramatically.

Also note that if you coalesce to a very small number of partitions this can become very slow (both because of copying data between the partitions and because of the reduced parallelism if you go to a number small enough). You might also get OOM errors if the data in a single partition is too large (when you coalesce the partitions naturally get bigger).

A couple of things to note:

  • saveAsParquetFile is depracated since version 1.4.0. Use write.parquet(path) instead.
  • Depending on your use case, searching for a specific string on parquet files might not be the most efficient way to go.