For "iterative algorithms," what is the advantage of converting to an RDD then back to a Dataframe

769 Views Asked by At

I am reading High Performance Spark and the author makes the following claim:

While the Catalyst optimizer is quite powerful, one of the cases where it currently runs into challenges is with very large query plans. These query plans tend to be the result of iterative algorithms, like graph algorithms or machine learning algorithms. One simple workaround for this is converting the data to an RDD and back to DataFrame/Dataset at the end of each iteration, as shown in Example 3-58.

Example 3-58 is labeled "Round trip through RDD to cut query plan" and is reproduced below:

val rdd = df.rdd
rdd.cache()
sqlCtx.createDataFrame(rdd. df.schema)

Does anyone know what is the underlying reason that makes this workaround necessary?

For reference, a bug report has been filed for this issue and is available at the following link: https://issues.apache.org/jira/browse/SPARK-13346

There does not appear to be a fix, but the maintainers have closed the issue and do not seem to believe they need to address it.

1

There are 1 best solutions below

0
On

From my understanding the lineage keeps on growing in iterative algorithms, i.e.

step 1: read DF1, DF2

step 2: update DF1 based on DF2 value

step 3: read DF3

step 4: update DF1 based on DF3 value

..etc..

In this scenario DF1 lineage keeps on growing and unless its truncated using DF1.rdd it will crash the driver after 20 or so iterations..