I am reading High Performance Spark and the author makes the following claim:
While the Catalyst optimizer is quite powerful, one of the cases where it currently runs into challenges is with very large query plans. These query plans tend to be the result of iterative algorithms, like graph algorithms or machine learning algorithms. One simple workaround for this is converting the data to an RDD and back to DataFrame/Dataset at the end of each iteration, as shown in Example 3-58.
Example 3-58 is labeled "Round trip through RDD to cut query plan" and is reproduced below:
val rdd = df.rdd
rdd.cache()
sqlCtx.createDataFrame(rdd. df.schema)
Does anyone know what is the underlying reason that makes this workaround necessary?
For reference, a bug report has been filed for this issue and is available at the following link: https://issues.apache.org/jira/browse/SPARK-13346
There does not appear to be a fix, but the maintainers have closed the issue and do not seem to believe they need to address it.
From my understanding the lineage keeps on growing in iterative algorithms, i.e.
step 1: read DF1, DF2
step 2: update DF1 based on DF2 value
step 3: read DF3
step 4: update DF1 based on DF3 value
..etc..
In this scenario DF1 lineage keeps on growing and unless its truncated using DF1.rdd it will crash the driver after 20 or so iterations..