Is it stupid to do l2 normalization with sklearns Normalizer for a correlation analysis on this type of dataset?

47 Views Asked by At

My Data:

I have a dataset of molecules, their activities and physicochemical descriptors. Each row corresponds to a molecule and each column to a descriptor. There are 80 molecules with 10 descriptors. Some descriptor values are positive and negative and the scales vary strongly. There are also some outliers.

Preprocessing step:

I normalized this dataset with sklearn's Normalizer, with the defaults, by using

norm = Normalizer()
descriptors_norm = pd.DataFrame(norm.fit_transform(new_df), columns=new_df.columns)

Analysis:

I calculated a correlation matrix and saw some interesting relationships between descriptors and activities.

corr_matrix = descriptors_norm.corr()
ax = sns.heatmap(corr_matrix, cmap = "Spectral");

The problem:

If I plot the original unnormalized values against each other, I can't see any correlation in some cases. They only appear for the normalized values.
I found that the most dominant correlations were still there (e.g. between molecular weight and the number of valence electrons), but new, interesting ones between activities and physicochemical descriptors popped up.

Is the use of this Normalizer valid for my case, e.g. it helped to find complex relationships, or did I just destroy my data?

What I tried:

I tried the correlation analysis with different preprocessing methods (MinMaxScaler and StandardScaler). The new correlations could not be observed (only weak correlation coefficients around 0.3, while they were around 0.6 for the Normalizer). I would have expected that these methods also might show these correlations, if they are really in my data.

I also tried the Normalizer(norm = "l1") and it also found these correlations, like for the default Normalizer.

I tried to work with unnormalized data and used different correlation coefficients for the correlation matrix (Spearman, Kendall) and did not find these correlations. If the relationships are nonlinear, they should have popped up, right?

0

There are 0 best solutions below