If I repartition by column name does spark understand that it is repartitioned by that column when it is read back

600 Views Asked by Praveen Kumar B N At 18 August 2025 at 10:54

I have a requirement, where I have a huge dataset of over 2 Trillion records. This comes as a result of some join. And post this join, I need to aggregate on a column ('id' column) and get a list of distinct names (collect_set('name')).

Now, while saving the join result in step1, if I re-partition it on 'id' field, will I get any benefit? i.e. joined_df.repartition('id').write.parquet(path)
If I read the above repartitioned df, will spark understand that it is already repartitioned on id field, so that when I group by id, performance is hugely improved?

Original Q&A

There are 1 best solutions below

Islam Elbanna On 11 June 2023 at 07:34 BEST ANSWER

If the id column is unique, then you just add a huge overhead to partition by this column, since each partition will contain one record, so assuming this is not the case!

Calling repartition('id') will create partitions based on the id column, but it will not influence Spark that the data is already partitioned when reading back.

If the data of each id can fit in one partition, I'd say you could try to:

Repartition by column and number of partitions 1, to make sure each id is in one partition only.
Read the saved data and since each partition contains the data of one id(logically) you can avoid the extra group by and map the partitions directly.

Example:

joined_df.repartition(1, 'id').write.parquet(path)
...
spark.read.parquet(path).rdd.mappartitions(FUN).toDF([id, ...])

If I repartition by column name does spark understand that it is repartitioned by that column when it is read back

There are 1 best solutions below

Related Questions in APACHE-SPARK

Related Questions in PYSPARK

Related Questions in APACHE-SPARK-SQL-REPARTITION

Trending Questions

Popular # Hahtags

Popular Questions