Exactly same values for accuracy in RFECV

315 Views Asked by At

I'm trying to fit a logistic regression with RFECV. That's my code:

log_reg = LogisticRegression(solver = "lbfgs", 
                             max_iter = 1000)
random.seed(4711)
rfecv = RFECV(estimator = log_reg,
              scoring = "accuracy", 
              cv = 10)

Model = rfecv.fit(X_train, y_train)

I don't think there is anything wrong with my data or my code, but the accuracy is exactly the same for almost every different value of feature size:

Model.grid_scores_
array([0.76200776, 0.76200776, 0.76200776, 0.76200776, 0.76200776,
       0.76200776, 0.76200776, 0.76200776, 0.76200776, 0.76200776,
       0.76200776, 0.76200776, 0.76200776, 0.76200776, 0.76200776,
       0.76200776, 0.76200776, 0.76200776, 0.76200776, 0.76556425,
       0.80968999, 0.80962074])

How can this happen? My data is quite big (more than 20000 observations). I cannot imagine that in every fold of the cross validation the same cases are estimated correctly. But if so how could this happen? 1 variable can explain as much as 19 can but not as much as 20 could? Then why don't take the first and the 20th? I'm really confused.

1

There are 1 best solutions below

10
On

I believe all your accuracies are the same because LogisticRegression uses L2 regularization by default. That is, penalty='l2' unless you pass it something else.

This means that even when Model is using all 22 features, the underlying algorithm log_reg is penalizing the beta coefficients using L2 regularization. So if you prune the least important features, it won't affect the accuracy because the underlying logit model with 22 features has pushed the coefficients of the least important features close to zero.

I suggest you try:

# Model with no penalty
log_reg = LogisticRegression(solver='lbfgs', 
                             max_iter=1000,
                             penalty='none')

# Set seed
random.seed(4711)

# Initialize same search as before
rfecv = RFECV(estimator=log_reg,
              scoring='accuracy', 
              cv=10)

# Fit search
rfecv.fit(X_train, y_train)

# Tell us how it went
rfecv.grid_scores_