Pyspark offers a great possibility to parallelize cross-validation of models via https://github.com/databricks/spark-sklearn
as simple substitution of sklearn's GridSearchCV
with
from spark_sklearn import GridSearchCV
How can I achieve similar functionality for Spark's Scala CrossValidator
i.e. to parallelize each fold?
Since spark 2.3 :
You can do that using the
setParallelism(n)
method with theCrossValidator
or on creation. i.e :or
Before spark 2.3 :
You can't do that in Spark Scala. You can't parallelize Cross Validation in Scala Spark.
If you have read the documentation of
spark-sklearn
well, GridSearchCV is parallelized but the model training isn't. Thus this is kind of useless on scale. Furthermore, you can parallelize Cross Validation for the Spark Scala API due to the famousSPARK-5063
:Excerpt from the README.md :