I have a dataset, I cleaned it up and now before I will run the machine learning models, I am looking on the correlation.
I read about Person's r correlation:
- |0.5| to |1.00| = Strong
- |0.3| to |0.49| = Intermediate
- |0.0| to |0.29| = Weak
I did not understand a couple of things:
Independent column and independent column
- If I have a Strong correlation, is it a good thing or a bad thing?
- Doe's a strong correlation(not a perfect 1.0) mean that the two columns are basically the same?
- If the correlation are good\bad should I drop one of the two columns?
Independent column and dependent column
- If I have a Strong correlation, is it a good thing or a bad thing?
- If the correlation are good\bad should I drop the Independent columns?
If two columns(features) have a very high correlation you could indeed drop one of them and you will have the almost same or better results.
Another way of dealing with correlations in your data, that doesn't require as much manual inspection is "whitening" for example pca or zca. Like this you can also deal with features that have less than 100% correlation.
This will enable you to reduce the dimensionality and get rid of the correlation between features, so you need less powerful learning algorithms to get the same or better results.