Normalize data before removing low variance, makes errors

2.7k Views Asked by At

I'm testing the iris dataset (one can load with the function load_iris() from sklearn.datasets) with the scikit-learn functions normalize and VarianceThreshold.

It seems that if I'm using MinMaxScaler and then run VarianceThreshold - there are no features left.

Before scaling:

Column:  sepal length (cm)  Mean:  5.843333333333334  var =  0.6811222222222223  var/mean:  0.11656398554858338
Column:  sepal width (cm)  Mean:  3.0573333333333337  var =  0.1887128888888889  var/mean:  0.06172466928332606
Column:  petal length (cm)  Mean:  3.7580000000000005  var =  3.0955026666666665  var/mean:  0.8237101295015078
Column:  petal width (cm)  Mean:  1.1993333333333336  var =  0.5771328888888888  var/mean:  0.48121141374837856

After scaling (MinMaxScaler)

Column:  sepal length (cm)  Mean:  0.42870370370370364  var =  0.052555727023319614  var/mean:  0.12259219262459005
Column:  sepal width (cm)  Mean:  0.44055555555555553  var =  0.03276265432098764  var/mean:  0.07436668067815606
Column:  petal length (cm)  Mean:  0.46745762711864397  var =  0.08892567269941587  var/mean:  0.19023258481745967
Column:  petal width (cm)  Mean:  0.4580555555555556  var =  0.10019668209876545  var/mean:  0.2187435145879658

I'm using VarianceThreshold as:

    from sklearn.feature_selection import VarianceThreshold
    sel = VarianceThreshold(threshold=(.8 * (1 - .8)))

Should we scale the data (for example, through MinMaxScaler) if we want to remove features with low variance?

3

There are 3 best solutions below

1
On

Basically, a low variance feature means a feature that lacks information. That is, if a feature has a variance close to zero, it is close to take a constant value. However, each feature could represent a different quantity, so it's variance is different.

For example, consider as covariates age which could range from 0 to 100 and number_of_childs that could range from 0 to 5 as an example. As these two variables take different values, they would have different variances. Now, by scaling the features one sets them to the same units. In that way, we could compare their information on the same scale.

Notice that for the iris data set all features are set to the same scale (centimeters), that is,

from sklearn.datasets import load_iris

data = load_iris()
print(data.features_names) 
>>> ['sepal length (cm)',
     'sepal width (cm)',
     'petal length (cm)',
     'petal width (cm)']

In this case, a good first step would be to center the data. By doing this one can remove noise from it.

import pandas as pd 

X = pd.DataFrame(data['data'], columns=data.feature_names)
X = X - X.mean()
2
On

The MinMaxScaler is using the following formula:

X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_scaled = X_std * (max - min) + min

If you check the docs of variance treshold and see the formula for variance, the variance of a set of n equally likely values can be equivalently expressed, without directly referring to the mean, in terms of squared deviations of all points from each other:

enter image description here

So lets compare a little example with two columns and three variables:

a  b
1  0
0  1
0  2

Without scalling we have the following variances:

a: (0.5(1-0)^2+0.5(1-0)^2+ 0.5(0-1)^2 +0.5(0-0)^2 + 0.5(0-1)^2 + 0.5(0-1)^2 )/3 = (0.5+0.5+0.5+0.5)/3= 2/3 = 0.75
b: 6/3 = 2

After MinMaxScaler we would have:

a  b
1  0
0  0.5
0  1

and so the variance:

a: 2/3
b: 2/3 

So with threshold 0.8 both would be kicked out after normalization.

So yes, when you normalize your data before variancethreshold you will always kickout more columns, because the basic idea of minmaxscaler is to normalize your data that means you will have less variance in it.

0
On

Scaling data generally will not help you finding redundant features.

Usually, VarianceThreshold is used to remove features with variance equal to zero, that is constants that provide no information whatsoever. The line in your code VarianceThreshold(threshold=(.8 * (1 - .8))) throws away all features with variance below 0.16. And in your case all the features have variance below that (after MinMaxScaler the highest variance is petal width of 0.1), so you throw away everything. I believe you have meant to leave features that contribute more than 80% of the variance, but it is not what your code does. And if you would apply that line before MinMaxScaler, then all your features would pass.

In order to remove features with low variance, you need first to define what is the reasonable threshold for that specific feature. But in a general case you can't set a hard coded arbitrary threshold for variance, because for some features the value would be too high and for others too low. For example, PCA is often used as feature selection procedure. One performs PCA and take only K first eigenvectors, where K is selected in such a way that the "energy" of the corresponding eigenvalues is (say) 95% (or even 80%) of the total. So in cases when you have a dataset with 50-100 features you can reduce amount of features tenfold without loosing much information.

When you apply StandardScaler all your features will be cenetered and normed, so their mean will be zero, and variance 1 (except for the constants, of course). MinMaxScaler by default will bring your features into a range [0..1]. The question is not which scaler to use, but why do you want to use scaler. In general case you don't want to throw away features unless you need to.

The assumption that information is being held in the variance is not true for most real datasets, and many times features with lower variance does not correspond with low information feature. As your final goal not to reduce amount of features but create a better classification algorithm, you should not optimize too hard on the intermediate goals.