I have a dataset with 2112 features and 2337 entries. I am trying to see the correlation between these features and the dependent variable. All of the features and the outcome variable are numeric. The features have been standard scaled. I am trying to use bonferroni correction to get the number of features significantly correlated to the outcome variable.
The first way I am trying is :
from pingouin import power_corr
alpha = 0.05/2112
power_r = power_corr(n=2391,power=0.80,alpha=alpha,alternative="greater")
# power_r = 0.2
r = []
count = 0
indices = []
p_vals = []
# x is the matrix containing the features
# y is outcome variable
for i in range(x.shape[1]):
coef,sign = pearsonr(x[:,i],y)
if sign < alpha and abs(coef) >= power_r:
# check if the correlation co-efficient is greater than the computer power_r and the
# associated p values is less than the computer alpha
count+=1
indices.append(i)
r.append(coef)
p_vals.append(sign)
power_r
gives the minimum correlation coefficient above which the correlation co-efficent should be in order to even be accounted for significantly correlated.
If I print count
I get 1361 i.e 1361 features are significantly correlated with the outcome variable.
However, if I use the generated p_vals
of pearsonr
and feed it to
from statsmodels.stats.multitest import multipletests
rejected, p_adjusted, _, alpha_corrected = multipletests(p_vals, alpha=0.05, method='bonferroni', is_sorted=False, returnsorted=False)
np.sum(rejected) # 1501 features that are significant
So which is the better way of finding significance in correlation in here ? Or am I compating apples to oranges here ?