I am performing an aggregated array collection using the following code in pyspark:
df1=df.groupBy('key').agg(collect_list('value'))
I know functions like collect forces data into a single node. Is it possible to achieve the same result while at the same time leveraging the power of distributed cloud computing?
There seems to be a bit of misunderstanding here
collect
forces the data to be collected over driver and is not distributedwhereas
collect_list
andcollect_set
are distributed operations by default.