Is there a way to utilize the map function to store each row of the pyspark dataframe into a self-defined python class object?
For example, in the picture above I have a spark dataframe, I want to store every row of id, features, label into a node object (with 3 attributes node_id, node_features, and node_label). I am wondering if this is feasible in pyspark. I have tried something like
for row in df.rdd.collect() do_something (row)
but this can not handle big data and is extremely slow. I am wondering if there is a more efficient way to resolve it. Much thanks.
                        
You can use
foreachmethod for your operation. The operation will be parallelized in spark.Refer Pyspark applying foreach if you need more details.