Obtain variable selection order from glmnet

1k Views Asked by At

I've been using the glmnet R package to build a LASSO regression model for one target variable Y (numeric) and 762 covariates. I use the glmnet() function and then coef(fit, s = 0.056360) to get the coefficient values for that specific value of lambda.

What I now need is the variable selection order, i.e. which of the selected covariates is selected first (enters the model first), second, third and so on.

When using plot(fit, label = TRUE) I can theoretically see the order via the plotted paths, however, there are too many covariates for the labels to be legible.

You can see from the image that the first covariate is 267 (green path), then comes 12, but the rest is illegible.

covariate paths

3

There are 3 best solutions below

0
On

Your coefficients are stored under:

head(fit$beta[,1:6])
6 x 6 sparse Matrix of class "dgCMatrix"
     s0          s1         s2         s3        s4         s5
cyl   . -0.01192151 -0.1447790 -0.2658654 -0.376195 -0.4770942
disp  .  .           .          .          .         .        
hp    .  .           .          .          .         .        
drat  .  .           .          .          .         .        
wt    . -0.45776130 -0.7006176 -0.9218541 -1.123436 -1.3065806
qsec  .  .           .          .          .         .      

You can get a similar plot like this:

plot(fit$beta, col = rep(1:nrow(fit$beta),ncol(fit$beta)),pch=20,cex=0.3)

enter image description here

So let's say I use the lambda value s31 :

lambda_values = data.frame(name = colnames(fit$beta),fit$lambda)

subset(lambda_values,name=="s31")
   name fit.lambda
32  s31  0.2877579

We pull out the matrix up to that lambda value:

mat = fit$beta[,1:which(colnames(fit$beta)=="s31")]

And write a function to return the index of the first non-zero coefficient, or return the last if all are zeros:

first_nonzero = function(x){
                if(any(x!=0)){
                     min(which(x!=0))
                }else{
                     length(x)
                }
                }

Apply this to every row and we get the index where they first enter:

apply(mat,1,first_nonzero)

 cyl disp   hp drat   wt qsec   vs   am gear carb 
   2   32   10   26    2   30   30   23   32   24 
                 
0
On

You can find the fitted model for each lambda along the path in fit$beta. One way to get what you want is to loop through that matrix and check at which step each variable enters the model. You can then use that information to order the list of variables. Here is a quick-and-dirty way to do this:

# Convert the beta table to a matrix object:
betas <- as.matrix(fit$beta)

# Get a list of the variable names:
varlist <- row.names(betas)

# Loop over all variables and check when they enter the model:
which_step <- rep(NA, length(varlist))
for(i in 1:length(varlist))
{
      # The variable enters the model at the first step where it's non-zero:
      which_step[i] <- which.max(betas[i,] != 0)
}

# Order the list of variables after when they enter the model:
varlist[order(which_step)]
0
On

I wrote a function glmnetPath to do this. It initially appeared in my answer to a more recent Stack Overflow Q & A (sorry, I discovered this Q & A later), but was then shipped in an R package on GitHub.

## you may need to first install package "remotes" from CRAN
remotes::install_github("ZheyuanLi/solzy")

## Zheyuan Li's R functions on Stack Overflow
library(solzy)

The function works for models fitted by glmnet and cv.glmnet.

It is computationally efficient. It is fully vectorized without using loops. It also does not convert sparse coefficient matrix to a dense matrix for processing.

It provides a very informative summary of a coefficient path. Here is a reproducible example for demonstration.

library(glmnet)
library(solzy)
set.seed(42)
x <- matrix(rnorm(100 * 20), 100, 20)
y <- rnorm(100)
fit <- glmnet(x, y)
ans <- glmnetPath(fit)

## Try this if you want. It also works!!
## cvfit <- cv.glmnet(x, y)
## glmnetPath(cvfit)

The resulting ans is a list of 3 data frames: enter, leave and ignore.

Interpretation of enter

As lambda decreases, variables (see i for numeric ID and var for variable names) enter the model in turn (see ord for the order). The corresponding lambda for the event is fit$lambda[j].

ans$enter
#    i  j ord var     lambda
#1   2  2   1  V2 0.16479873
#2   4  2   1  V4 0.16479873
#3  13  4   2 V13 0.13681880
#4   9  8   3  V9 0.09430389
#5  11  9   4 V11 0.08592619
#6  15  9   4 V15 0.08592619
#7   7 11   5  V7 0.07133744
#8  18 11   5 V18 0.07133744
#9  20 12   6 V20 0.06500001
#10  5 13   7  V5 0.05922559
#11 16 13   7 V16 0.05922559
#12 17 14   8 V17 0.05396415
#13  6 16   9  V6 0.04480199
#14  3 17  10  V3 0.04082190
#15 10 18  11 V10 0.03719540
#16 19 19  12 V19 0.03389106
#17  1 22  13  V1 0.02563735
#18 12 26  14 V12 0.01767083
#19 14 26  14 V14 0.01767083

Interpretation of leave

As lambda increases, variables (see i for numeric ID and var for variable names) leave the model in turn (see ord for the order). The corresponding lambda for the event is fit$lambda[j].

ans$leave
#    i  j ord var     lambda
#1  14 25   1 V14 0.01939371
#2  12 25   1 V12 0.01939371
#3   1 21   2  V1 0.02813695
#4  19 18   3 V19 0.03719540
#5  10 17   4 V10 0.04082190
#6   3 16   5  V3 0.04480199
#7   6 15   6  V6 0.04917013
#8  17 13   7 V17 0.05922559
#9  16 12   8 V16 0.06500001
#10  5 12   8  V5 0.06500001
#11 20 11   9 V20 0.07133744
#12 18 10  10 V18 0.07829275
#13  7 10  10  V7 0.07829275
#14 15  8  11 V15 0.09430389
#15 11  8  11 V11 0.09430389
#16  9  7  12  V9 0.10349840
#17 13  3  13 V13 0.15015846
#18  4  1  14  V4 0.18086640
#19  2  1  14  V2 0.18086640

Interpretation of ignored

If non-empty, it lists variables that never enter the model. That is, they are effectively ignored. Yes, this can happen!

ans$ignored
#  i var
#1 8  V8

Notes:

  1. fit$lambda is decreasing, so j is in ascending order in enter but in descending order in leave.

  2. Several variables can enter or leave the model at the same time! The ord column is very informative on this. For example, variable 2 and 4 enter the model simultaneously.

Finally, let me attach the path plots produced by glmnet:

plot(fit, label = TRUE)

path-plot1

plot(fit, xvar = "lambda", label = TRUE)

path-plot2