How to load data only once for multiple glm calls with varying formulas?

676 Views Asked by At

I have a dataset with 1 column for dependent variable and 9 for independent variables. I have to fit logit models in R taking all combinations of the independent variables.

I have created formulae for the same to be used in "glm" function. However, every time I call "glm" function, it loads the data (which is same every time as only the formula changes in each iteration).

Is there a way to avoid this so as to speed up my computation? Can I use a vector of formulae in "glm" function and load data only once?

Code:

tempCoeffV <- lapply(formuleVector, function(s) {   coef(glm(s,data=myData,family=binomial, y=FALSE, model=FALSE))})


formuleVector is a vector of strings like: 
myData[,1]~myData[,2]+myData[,3]+myData[,5]
myData[,1]~myData[,2]+myData[,6] 

myData is data.frame

In each lapply statement, myData remains the same. It is a data.frame with around 1,00,000 records. formuleVector is a vector with 511 different formulas. Is there a way to speed up this computation?

1

There are 1 best solutions below

1
On BEST ANSWER

Great, you don't have factors; othersie I have to call model.matrix then play with $assign field, rather than simply using data.matrix.

## Assuming `mydata[, 1]` is your response

## complete model matrix and model response
X <- data.matrix(mydata); y <- X[, 1]; X[, 1] <- 1

## covariates names and response name
vars <- names(mydata)

This is how you get your 511 candidates, right?

choose(9, 1:9)
# [1]   9  36  84 126 126  84  36   9   1

Now instead of the number of combinations, we need a combination index, easy to get from combn. The rest of the story is to write a loop nest and loop through all combinations. glm.fit is used, as you only care coefficients.

  1. model matrix has been set up; we only dynamically select its columns;
  2. loop nest is not terrible; glm.fit is much more costly than your for loop. For readability, don't recode them as lapply for example.

lst <- vector("list", 9)  ## a list to store all result
for ( k in 1:9 ) {
  ## combn index; each column is a combination
  ## plus 1 as an offset as there is an intercept in `X`
  I <- combn(9, k) + 1
  ## now loop through all combinations, calling `glm.fit`
  n <- choose(9, k)
  lstk <- vector("list", n)
  for ( j in seq.int(n) )
    ## current index
    ind <- I[, j]
    ## get regression coefficients
    b <- glm.fit(X[, c(1, ind)], y, family = binomial())$coefficients
    ## attach model formula as an attribute
    attr(b, "formula") <- reformulate(vars[ind], vars[1])
    ## store
    lstk[[j]] <- b
    }
  lst[[k]] <- lstk
  }

In the end, lst is a nested list. Use str(lst) to understand it.