pyspark map each row in dataframe and apply UDF which return dataframe

1k Views Asked by At

I have a dataframe with several number of rows. I can loop through this dataframe using this code :

for row in df.rdd.collect():

But this is won't work in parallel right? So what I want is to map each row and pass it to UDF and return another new dataframe (from a DB) according to value in row.

I tried df.rdd.map(lambda row:read_from_mongo(row,spark)).toDF()

But I got this error:

_pickle.PicklingError: Could not serialize object: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.

How do I loop a dataframe in parallel and hold the dataframe returning for each row?

1

There are 1 best solutions below

3
On

Every Spark RDD or DataFrame created is associated with the SparkContext of the application and SparkContext can only be referenced to in the driver code. Your UDF which returns a DataFrame tries to reference to the SparkContext from the workers and not from the driver. So, why do you need to create a separate DataFrame for each row? If - you wish to later union the resulting DataFrames into one. - the first DataFrame is small enough. Then, you can simply collect the the DataFrame's content and use it as the filter to return the rows from the Mongodb. Here for parallelism, you need to rely on the connector your using to connect to Mongodb.