Hy, I have a question about partitioning in Spark,in Learning Spark book, authors said that partitioning can be useful, like for example during PageRank at page 66 and they write :
since links is a static dataset, we partition it at the start with partitionBy(), so that it does not need to be shuffled across the network
Now I'm focused about this example, but my questions are general:
- why a partitioned RDD doesn't need to be shuffled?
- PartitionBy() is a wide transformation,so it will produce shuffle anyway,right?
- Could someone illustrate a concrete example and what happen into each single node when partitionBy happens?
Thanks in advance
When the author does:
He's partitioning the data set into 100 partitions where each key will be hashed to a given partition (
pageId
in the given example). This means that the same key will be stored in a single given partition. Then, when he does thejoin
:All chunks of data with the same
pageId
should already be located on the same executor, avoiding the need for a shuffle between different nodes in the cluster.Yes,
partitionBy
produces aShuffleRDD[K, V, V]
:Basically,
partitionBy
will do the following:It will hash the key modulu the number of partitions (100 in this case), and since it relys on the fact that the same key will always produce the same hashcode, it will package all data from a given id (in our case,
pageId
) to the same partition, such that when youjoin
, all data will be available in that partition already, avoiding the need for a shuffle.