I have a dataframe
with several number of rows. I can loop through this dataframe
using this code :
for row in df.rdd.collect():
But this is won't work in parallel right? So what I want is to map each row and pass it to UDF and return another new dataframe (from a DB) according to value in row.
I tried df.rdd.map(lambda row:read_from_mongo(row,spark)).toDF()
But I got this error:
_pickle.PicklingError: Could not serialize object: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
How do I loop a dataframe
in parallel and hold the dataframe
returning for each row?
Every Spark RDD or DataFrame created is associated with the SparkContext of the application and SparkContext can only be referenced to in the driver code. Your UDF which returns a DataFrame tries to reference to the SparkContext from the workers and not from the driver. So, why do you need to create a separate DataFrame for each row? If - you wish to later union the resulting DataFrames into one. - the first DataFrame is small enough. Then, you can simply collect the the DataFrame's content and use it as the filter to return the rows from the Mongodb. Here for parallelism, you need to rely on the connector your using to connect to Mongodb.