Why so many Parquet files created? Can we not limit Parquet output files?

2.2k Views Asked by Manjeet Singh At 03 January 2017 at 05:41

Why so many Parquet files created in sparkSql? Can we not limit Parquet output files ?

There are 1 best solutions below

Assaf Mendelson On 03 January 2017 at 08:25

in general when you write to parquet it will write one (or more depending on various options) files per partition. If you want to reduce the number of files you can call coalesce on the dataframe before writing. e.g.:

df.coalesce(20).write.parquet(filepath)

Of course if you have various options (e.g. partitionBy) the number of files can increase dramatically.

Also note that if you coalesce to a very small number of partitions this can become very slow (both because of copying data between the partitions and because of the reduced parallelism if you go to a number small enough). You might also get OOM errors if the data in a single partition is too large (when you coalesce the partitions naturally get bigger).

A couple of things to note:

saveAsParquetFile is depracated since version 1.4.0. Use write.parquet(path) instead.
Depending on your use case, searching for a specific string on parquet files might not be the most efficient way to go.

Why so many Parquet files created? Can we not limit Parquet output files?

There are 1 best solutions below

Related Questions in APACHE-SPARK

Related Questions in APACHE-SPARK-SQL

Related Questions in APACHE-SPARK-MLLIB

Related Questions in PARQUET

Trending Questions

Popular # Hahtags

Popular Questions