distributing collect_list function across worker nodes

390 Views Asked by At

I am performing an aggregated array collection using the following code in pyspark:

df1=df.groupBy('key').agg(collect_list('value'))

I know functions like collect forces data into a single node. Is it possible to achieve the same result while at the same time leveraging the power of distributed cloud computing?

1

There are 1 best solutions below

1
On BEST ANSWER

There seems to be a bit of misunderstanding here

collect forces the data to be collected over driver and is not distributed

whereas

collect_list and collect_set are distributed operations by default.