How does VectorSlicer work in Spark 2.0?

1.1k Views Asked by At

In the Spark official documentation,

VectorSlicer is a transformer that takes a feature vector and outputs a new feature vector with a sub-array of the original features. It is useful for extracting features from a vector column.

  • Does this select the important features from the set of features?

  • If that is the case how is it done without the mention of a dependent variable?

I am trying to perform data clustering and I need the important features which will contribute to the clusters better. Can I use VectorSlicer for this?

1

There are 1 best solutions below

2
On BEST ANSWER

Does this select the important features from the set of features?

It doesn't. It literally slices the vector to select only specified indices.

and need the important features which will contribute to the clusters better.

  • If you have categorical data consider using ChiSqSelector.

  • Otherwise you can use dimensionality reduction like PCA. It won't be the same as feature selection but should provide similar benefits (keep only the most important signals, discard the rest).