I have two CSVs. df_sales
, df_products
. I want use pyspark to:
- Join
df_sales
anddf_products
onproduct_id
.df_merged = df_sales.join(df_products,df_sales.product_id==df_products.product_id,"inner")
- Compute the summation of
df_sales.num_pieces_sold
per product.df_sales.groupby("product_id").agg(sum("num_pieces_sold"))
Both 1 and 2 would require the df_sales
to be shuffled on product_id
How can I avoid shuffling df_sales
2 times?
One solution to do what you ask would be to use
repartition
to shuffle the dataframe once, and thencache
to keep the result in memory:However, I am not sure this is a good idea. Depending on its size, caching the entire
df_sales
dataframe might take a lot of memory. Also, thegroupBy
will only shuffle two columns of the dataframe, which could turn out to be rather inexpensive. I would start by making sure of that before trying to avoid a shuffle.More generally, before trying to optimize anything, write it simply, run it, see what takes time and focus on that.