Using glm in R for linear regression on a large dataframe - issues with column subsetting

69 Views Asked by At

I am trying to use glm in R using a dataframe containing ~ 1000 columns, where I want to select a specific independent variable and run as a loop for each of the 1000 columns representing the dependent variables.

As a test, the glm equation works perfectly fine when I specify a single column using df$col1 for both my dependent and independent variables.

I can't seem to correctly subset a range of columns (below) and I keep getting this error, no matter how many ways I try to format the df:

'data' must be a data.frame, environment, or list

What I tried:

df = my df
cols <- df[, 20:1112]

for (i in cols{
    glm <- glm(df$col1 ~ ., data=df, family=gaussian)
}
1

There are 1 best solutions below

0
Ben Bolker On BEST ANSWER

It would be more idiomatic to do:

predvars <- names(df)[20:1112]
glm_list <- list()  ## presumably you want to save the results??
for (pv in predvars) {
    glm_list[[pv]] <- glm(reformulate(pv, response = "col1"), 
       data=df, family=gaussian)
}

In fact, if you really just want to do a Gaussian GLM then it will be slightly faster to use

lm(reformulate(pv, response = "col1"), data = df)

in the loop instead.

If you want to get fancy:

formlist <- lapply(predvars, reformulate, response = "col1")
lm_list <- lapply(formlist, lm, data = df)
names(lm_list) <- predvars