I'm testing the iris
dataset (one can load with the function load_iris()
from sklearn.datasets
) with the scikit-learn functions normalize
and VarianceThreshold
.
It seems that if I'm using MinMaxScaler
and then run VarianceThreshold
- there are no features left.
Before scaling:
Column: sepal length (cm) Mean: 5.843333333333334 var = 0.6811222222222223 var/mean: 0.11656398554858338
Column: sepal width (cm) Mean: 3.0573333333333337 var = 0.1887128888888889 var/mean: 0.06172466928332606
Column: petal length (cm) Mean: 3.7580000000000005 var = 3.0955026666666665 var/mean: 0.8237101295015078
Column: petal width (cm) Mean: 1.1993333333333336 var = 0.5771328888888888 var/mean: 0.48121141374837856
After scaling (MinMaxScaler
)
Column: sepal length (cm) Mean: 0.42870370370370364 var = 0.052555727023319614 var/mean: 0.12259219262459005
Column: sepal width (cm) Mean: 0.44055555555555553 var = 0.03276265432098764 var/mean: 0.07436668067815606
Column: petal length (cm) Mean: 0.46745762711864397 var = 0.08892567269941587 var/mean: 0.19023258481745967
Column: petal width (cm) Mean: 0.4580555555555556 var = 0.10019668209876545 var/mean: 0.2187435145879658
I'm using VarianceThreshold
as:
from sklearn.feature_selection import VarianceThreshold
sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
Should we scale the data (for example, through MinMaxScaler
) if we want to remove features with low variance?
Basically, a low variance feature means a feature that lacks information. That is, if a feature has a variance close to zero, it is close to take a constant value. However, each feature could represent a different quantity, so it's variance is different.
For example, consider as covariates
age
which could range from 0 to 100 andnumber_of_childs
that could range from 0 to 5 as an example. As these two variables take different values, they would have different variances. Now, by scaling the features one sets them to the same units. In that way, we could compare their information on the same scale.Notice that for the iris data set all features are set to the same scale (centimeters), that is,
In this case, a good first step would be to center the data. By doing this one can remove noise from it.