For "iterative algorithms," what is the advantage of converting to an RDD then back to a Dataframe

764 Views Asked by Allen Han At 21 October 2025 at 00:42

I am reading High Performance Spark and the author makes the following claim:

While the Catalyst optimizer is quite powerful, one of the cases where it currently runs into challenges is with very large query plans. These query plans tend to be the result of iterative algorithms, like graph algorithms or machine learning algorithms. One simple workaround for this is converting the data to an RDD and back to DataFrame/Dataset at the end of each iteration, as shown in Example 3-58.

Example 3-58 is labeled "Round trip through RDD to cut query plan" and is reproduced below:

val rdd = df.rdd
rdd.cache()
sqlCtx.createDataFrame(rdd. df.schema)

Does anyone know what is the underlying reason that makes this workaround necessary?

For reference, a bug report has been filed for this issue and is available at the following link: https://issues.apache.org/jira/browse/SPARK-13346

There does not appear to be a fix, but the maintainers have closed the issue and do not seem to believe they need to address it.

Original Q&A

There are 1 best solutions below

sm96 On 04 February 2021 at 03:01

From my understanding the lineage keeps on growing in iterative algorithms, i.e.

step 1: read DF1, DF2

step 2: update DF1 based on DF2 value

step 3: read DF3

step 4: update DF1 based on DF3 value

..etc..

In this scenario DF1 lineage keeps on growing and unless its truncated using DF1.rdd it will crash the driver after 20 or so iterations..

For "iterative algorithms," what is the advantage of converting to an RDD then back to a Dataframe

There are 1 best solutions below

Related Questions in APACHE-SPARK

Related Questions in APACHE-SPARK-SQL

Related Questions in RDD

Related Questions in CATALYST-OPTIMIZER

Trending Questions

Popular # Hahtags

Popular Questions