Pyspark use customized function to store each row into a self defined object, for example a node object

161 Views Asked by At

Is there a way to utilize the map function to store each row of the pyspark dataframe into a self-defined python class object?

pyspark dataframe

For example, in the picture above I have a spark dataframe, I want to store every row of id, features, label into a node object (with 3 attributes node_id, node_features, and node_label). I am wondering if this is feasible in pyspark. I have tried something like

for row in df.rdd.collect() do_something (row)

but this can not handle big data and is extremely slow. I am wondering if there is a more efficient way to resolve it. Much thanks.

1

There are 1 best solutions below

1
On

You can use foreach method for your operation. The operation will be parallelized in spark.

Refer Pyspark applying foreach if you need more details.