Is there a way to utilize the map function to store each row of the pyspark dataframe into a self-defined python class object?
For example, in the picture above I have a spark dataframe, I want to store every row of id, features, label into a node object (with 3 attributes node_id, node_features, and node_label). I am wondering if this is feasible in pyspark. I have tried something like
for row in df.rdd.collect() do_something (row)
but this can not handle big data and is extremely slow. I am wondering if there is a more efficient way to resolve it. Much thanks.
You can use
foreach
method for your operation. The operation will be parallelized in spark.Refer Pyspark applying foreach if you need more details.