Spark even data distribution

479 Views Asked by At

I am trying to solve skewed data problem in the dataframe. I have introduced a new column based on bin packing algorithm which should evenly distribute data among the bins (partitions in my case). My count for the bin is 500,000 rows. I have assigned each row a bin number where the row should belongs to. The bin count ranges from 1 to 282. Let's say the column name is key.

Ideally when I do a repartition operation on the dataframe based on the column key, it should distribute data evenly among 282 partitions, each containing records around 500,000.

| key |count |
+-----+------+
|1    |495941|
|2    |499607|
|3    |498896|
|4    |502845|
|5    |498213|
|6    |501325|
|7    |502355|
|8    |501816|
|9    |498829|
|10   |498272|
|11   |499802|
|12   |501580|
|13   |498779|
|14   |498654|
...
...
|282  |491258|

But still some partitions contain multiple keys. For example, partition 101 and 115 are combined into 1 partition, which is unexpected behavior for me.

+----+------+
|key |count |
+----+------+
|101 |500014|
|115 |504995|
+----+------+

If I write a custom partitioner, then I have to convert my dataframe to rdd and operate on a pair-rdd based on the key column. But Key column can have duplicates initially, and If I groupBy, then multiple records will be combined together and it will break the logic to repartition the data.

It would be great if someone can explain this strange behavior of reparition and help me to make it correct.

0

There are 0 best solutions below