I am new to data science and trying to get a grip on exploratory data analysis. My goal is to get a correlation matrix between all the variables. For numerical variables I use Pearson's R, for categorical variables I use the corrected Cramer's V. The issue now is to get a meaningful correlation between categorical and numerical variables. For that I use the correlation ratio, as outlined here. The issue with that is that categorical variables with high cardinality show a high correlation no matter what:
correlation matrix cat vs. num
This seems nonsensical, since this would practically show the cardinality of the the categorical variable instead of the correlation to the numerical variable. The question is: how to deal with the issue in order to get a meaningful correlation.
The Python code below shows how I implemented the correlation ratio:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
train = pd.DataFrame({
'id': [0,1,2,3,4,5,6,7,8,9,10,11], 'num3': [6,3,3,9,6,9,9,3,6,3,6,9],
'cat2': [0,1,0,1,0,1,0,1,0,1,0,1], 'cat3': [0,1,2,0,1,2,0,1,2,0,1,2],
'cat6': [0,4,8,2,6,10,0,4,8,2,6,10], 'cat12': [0,7,2,9,4,11,6,1,8,3,10,5],
})
cat_cols, num_cols = ['cat2','cat3','cat6','cat12'], ['id','num3']
def corr_ratio(cats, nums):
avgtotal = nums.mean()
elements_avg, elements_count = np.zeros(len(cats.index)), np.zeros(len(cats.index))
cu = cats.unique()
for i in range(cu.size):
cn = cu[i]
filt = cats == cn
elements_count[i] = filt.sum()
elements_avg[i] = nums[filt].mean(axis=0)
numerator = np.sum(np.multiply(elements_count, np.power(np.subtract(elements_avg, avgtotal), 2)))
denominator = np.sum(np.power(np.subtract(nums, avgtotal), 2)) # total variance
return 0.0 if numerator == 0 else np.sqrt(numerator / denominator)
rows = []
for cat in cat_cols:
col = []
for num in num_cols:
col.append(round(corr_ratio(train[cat], train[num]), 2))
rows.append(col)
df = pd.DataFrame(np.array(rows), columns=num_cols, index=cat_cols)
sns.heatmap(df)
plt.tight_layout()
plt.show()
If I am not mistaken, there is another method called Theil’s U. How about trying this out and see if the same problem will occur?
You can use this:
num_cols:
your_df.select_dtypes(include=['number']).columns.to_list()
cat_target_cols:
your_df.select_dtypes(include=['object']).columns.to_list()