I am new to Pyspark. I have a dataset that contains categorical features and I want to use regression models from pyspark to predict continuous values. I am stuck in pre-processing of data that is required for using MLlib models.

1

There are 1 best solutions below

0
On BEST ANSWER

Yes, it is necessary. You have to not only convert to numerical but also encode to make them useful for linear models. Both steps are implemented in pyspark.ml (not mllib) with:

  • pyspark.ml.feature.StringIndexer - indexing.
  • pyspark.ml.feature.OneHotEncoder - encoding.