Numerical vs. categorical vars: Why 100% correlation for categorical variable with high cardinality?

Question

Numerical vs. categorical vars: Why 100% correlation for categorical variable with high cardinality?

718 Views Asked by Taq Seorangpun At 31 July 2025 at 00:32

I am new to data science and trying to get a grip on exploratory data analysis. My goal is to get a correlation matrix between all the variables. For numerical variables I use Pearson's R, for categorical variables I use the corrected Cramer's V. The issue now is to get a meaningful correlation between categorical and numerical variables. For that I use the correlation ratio, as outlined here. The issue with that is that categorical variables with high cardinality show a high correlation no matter what:

correlation matrix cat vs. num

This seems nonsensical, since this would practically show the cardinality of the the categorical variable instead of the correlation to the numerical variable. The question is: how to deal with the issue in order to get a meaningful correlation.

The Python code below shows how I implemented the correlation ratio:

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

train = pd.DataFrame({
    'id': [0,1,2,3,4,5,6,7,8,9,10,11], 'num3': [6,3,3,9,6,9,9,3,6,3,6,9],
    'cat2': [0,1,0,1,0,1,0,1,0,1,0,1], 'cat3': [0,1,2,0,1,2,0,1,2,0,1,2],
    'cat6': [0,4,8,2,6,10,0,4,8,2,6,10], 'cat12': [0,7,2,9,4,11,6,1,8,3,10,5],
})
cat_cols, num_cols = ['cat2','cat3','cat6','cat12'], ['id','num3']

def corr_ratio(cats, nums):
    avgtotal = nums.mean()
    elements_avg, elements_count = np.zeros(len(cats.index)), np.zeros(len(cats.index))
    cu = cats.unique()
    for i in range(cu.size):
        cn = cu[i]
        filt = cats == cn
        elements_count[i] = filt.sum()
        elements_avg[i] = nums[filt].mean(axis=0)
    numerator = np.sum(np.multiply(elements_count, np.power(np.subtract(elements_avg, avgtotal), 2)))
    denominator = np.sum(np.power(np.subtract(nums, avgtotal), 2))  # total variance
    return 0.0 if numerator == 0 else np.sqrt(numerator / denominator)

rows = []
for cat in cat_cols:
    col = []
    for num in num_cols:
        col.append(round(corr_ratio(train[cat], train[num]), 2))
    rows.append(col)

df = pd.DataFrame(np.array(rows), columns=num_cols, index=cat_cols)
sns.heatmap(df)
plt.tight_layout()
plt.show()

Original Q&A

There are 2 best solutions below

**Ming Jun Lim** · Answer 1

If I am not mistaken, there is another method called Theil’s U. How about trying this out and see if the same problem will occur?

You can use this:
num_cols: your_df.select_dtypes(include=['number']).columns.to_list()
cat_target_cols: your_df.select_dtypes(include=['object']).columns.to_list()

corr_df = pd.DataFrame(associations(dataset=your_df, numerical_columns=num_cols, nom_nom_assoc='theil', figsize=(20, 20), nominal_columns=cat_target_cols).get('corr'))

**Pierrick Leroy** · Answer 2

It could be because I think you are visualising something more related to chi-2 in your seaborn plot. Cramer's V is a number derived from chi-2 but not equivalent. So it means you could have a high value for a specific cell but a more relevant value for Cramer's V. I'm not even sure it makes sense to compare raw modalities values because they could be on a totally different order of magnitude.

Chi 2 formula Cramer's V formula

Numerical vs. categorical vars: Why 100% correlation for categorical variable with high cardinality?

There are 2 best solutions below

Related Questions in PYTHON

Related Questions in STATISTICS

Related Questions in DATA-SCIENCE

Related Questions in CORRELATION

Related Questions in EXPLORATORY-DATA-ANALYSIS

Trending Questions

Popular # Hahtags

Popular Questions