Pyspark offers a great possibility to parallelize cross-validation of models via https://github.com/databricks/spark-sklearn
as simple substitution of sklearn's GridSearchCV with
from spark_sklearn import GridSearchCV
How can I achieve similar functionality for Spark's Scala CrossValidator i.e. to parallelize each fold?
Since spark 2.3 :
You can do that using the
setParallelism(n)method with theCrossValidatoror on creation. i.e :or
Before spark 2.3 :
You can't do that in Spark Scala. You can't parallelize Cross Validation in Scala Spark.
If you have read the documentation of
spark-sklearnwell, GridSearchCV is parallelized but the model training isn't. Thus this is kind of useless on scale. Furthermore, you can parallelize Cross Validation for the Spark Scala API due to the famousSPARK-5063:Excerpt from the README.md :