Fit model with multiple responses using each predictor columns individually store results in dataframe

805 Views Asked by At

Reference is made towards Fit model using each predictor columns indiviually store results in dataframe, where a dataframe consists one column of a response variable and several columns of predictor variables. The author wished to fit models for the response variable using each of the predictor variables separately, finally creating a dataframe that contains the coefficients of the model. There's an answer https://stackoverflow.com/a/43959567/14435732 down the original question which interests me (copied below).

require(tibble)
require(dplyr)
require(tidyr)
require(purrr)
require(broom)

df <- iris
response_var <- "Sepal.Length"

vars <- tibble(response=response_var,
               predictor=setdiff(names(df), response_var))

compose_formula <- function(x, y)
  as.formula(paste0("~lm(", y, "~", x, ", data=.)"))

models <- tibble(data=list(df)) %>%
           crossing(vars) %>%
           mutate(fmla = map2(predictor, response, compose_formula),
                  model = map2(data, fmla, ~at_depth(.x, 0, .y)))

models %>% unnest(map(model, tidy))

My question is slightly different: now I have a dataframe with several columns of response variables (say Sepal.Length and Sepal.Width) and several columns of predictor variables (say Petal.Length, Petal.Width and Species) (see my first question from Performing a linear model in R of a single response with a single predictor from a large dataframe and repeat for each column). I got a very helpful answer but it would be perfect if I could have kept the names for response and predictor in the formula of the model object.

P.S: I have tried modifying the codes from https://stackoverflow.com/a/43959567/14435732 but encountered several issues:

  1. When I tried tibble() for vars (now that my response_var has several columns), this happens: 'Error: Tibble columns must have compatible sizes.'
  2. at_depth() is defunct.

Is there a way to get a desired output like below? (copied from https://stackoverflow.com/a/43959567/14435732)

# A tibble: 9 x 7
      response    predictor              term   estimate  std.error statistic
         <chr>        <chr>             <chr>      <dbl>      <dbl>     <dbl>
1 Sepal.Length  Sepal.Width       (Intercept)  6.5262226 0.47889634 13.627631
2 Sepal.Length  Sepal.Width       Sepal.Width -0.2233611 0.15508093 -1.440287
3 Sepal.Length Petal.Length       (Intercept)  4.3066034 0.07838896 54.938900
4 Sepal.Length Petal.Length      Petal.Length  0.4089223 0.01889134 21.646019
5 Sepal.Length  Petal.Width       (Intercept)  4.7776294 0.07293476 65.505517
6 Sepal.Length  Petal.Width       Petal.Width  0.8885803 0.05137355 17.296454
7 Sepal.Length      Species       (Intercept)  5.0060000 0.07280222 68.761639
8 Sepal.Length      Species Speciesversicolor  0.9300000 0.10295789  9.032819
9 Sepal.Length      Species  Speciesvirginica  1.5820000 0.10295789 15.365506
# ... with 1 more variables: p.value <dbl>
1

There are 1 best solutions below

3
On

You can use a multi response linear model, where each response is regressed against each predictor separately, so for example:

lm(cbind(Sepal.Length,Sepal.Width) ~ Species,data=iris)

Call:
lm(formula = cbind(Sepal.Length, Sepal.Width) ~ Species, data = iris)

Coefficients:
                   Sepal.Length  Sepal.Width
(Intercept)         5.006         3.428     
Speciesversicolor   0.930        -0.658     
Speciesvirginica    1.582        -0.454 

You get two sets of predictors and tidy does a good job on this:

tidy(lm(cbind(Sepal.Length,Sepal.Width) ~ Species,data=iris))
# A tibble: 6 x 6
  response     term              estimate std.error statistic   p.value
  <chr>        <chr>                <dbl>     <dbl>     <dbl>     <dbl>
1 Sepal.Length (Intercept)          5.01     0.0728     68.8  1.13e-113
2 Sepal.Length Speciesversicolor    0.93     0.103       9.03 8.77e- 16
3 Sepal.Length Speciesvirginica     1.58     0.103      15.4  2.21e- 32
4 Sepal.Width  (Intercept)          3.43     0.0480     71.4  5.71e-116
5 Sepal.Width  Speciesversicolor   -0.658    0.0679     -9.69 1.83e- 17
6 Sepal.Width  Speciesvirginica    -0.454    0.0679     -6.68 4.54e- 10  

Now we just need to repeat the above, changing the formula on the RHS , so we can do this by using reformulate, for example:

reformulate(termlabels="Species",response="cbind(Sepal.Length,Sepal.Width)")

cbind(Sepal.Length, Sepal.Width) ~ Species

We write a function to do this, and adding an additional column for the predictor:

library(purrr)
library(dplyr)

tidyMLM = function(iv,dat){

    f = reformulate(termlabels=iv,
    response="cbind(Sepal.Length,Sepal.Width)")
    
    tidy(lm(f,data=dat)) %>% mutate(predictor=iv,.after=response)
}

And use the amazing purrr (gone are the days of lapply..):

predictors = setdiff(colnames(iris),c("Sepal.Length","Sepal.Width"))

map_dfr(predictors,tidyMLM,dat=iris)
# A tibble: 14 x 7
   response    predictor    term          estimate std.error statistic   p.value
   <chr>       <chr>        <chr>            <dbl>     <dbl>     <dbl>     <dbl>
 1 Sepal.Leng… Petal.Length (Intercept)      4.31     0.0784     54.9  2.43e-100
 2 Sepal.Leng… Petal.Length Petal.Length     0.409    0.0189     21.6  1.04e- 47
 3 Sepal.Width Petal.Length (Intercept)      3.45     0.0761     45.4  9.02e- 89
 4 Sepal.Width Petal.Length Petal.Length    -0.106    0.0183     -5.77 4.51e-  8
 5 Sepal.Leng… Petal.Width  (Intercept)      4.78     0.0729     65.5  3.34e-111
 6 Sepal.Leng… Petal.Width  Petal.Width      0.889    0.0514     17.3  2.33e- 37
 7 Sepal.Width Petal.Width  (Intercept)      3.31     0.0621     53.3  1.84e- 98
 8 Sepal.Width Petal.Width  Petal.Width     -0.209    0.0437     -4.79 4.07e-  6
 9 Sepal.Leng… Species      (Intercept)      5.01     0.0728     68.8  1.13e-113
10 Sepal.Leng… Species      Speciesversi…    0.93     0.103       9.03 8.77e- 16
11 Sepal.Leng… Species      Speciesvirgi…    1.58     0.103      15.4  2.21e- 32
12 Sepal.Width Species      (Intercept)      3.43     0.0480     71.4  5.71e-116
13 Sepal.Width Species      Speciesversi…   -0.658    0.0679     -9.69 1.83e- 17
14 Sepal.Width Species      Speciesvirgi…   -0.454    0.0679     -6.68 4.54e- 10

If you have say 80 responses, 120 predictors:

df = data.frame(matrix(rnorm(100*200),ncol=200))
colnames(df) = c(paste0("r",1:80),paste0("p",1:120))
responses = as.matrix(df[,grep("r",colnames(df))])
predictors = grep("p",colnames(df),value=TRUE)

tidyMLM = function(iv,dat){
    
        f = reformulate(termlabels=iv,
        response="responses")
        
        tidy(lm(f,data=dat)) %>% mutate(predictor=iv,.after=response)
    }

map_dfr(predictors,tidyMLM,dat=df)

# A tibble: 19,200 x 7
   response predictor term        estimate std.error statistic p.value
   <chr>    <chr>     <chr>          <dbl>     <dbl>     <dbl>   <dbl>
 1 r1       p1        (Intercept)  -0.0959    0.0882    -1.09   0.280 
 2 r1       p1        p1            0.0217    0.0879     0.247  0.806 
 3 r2       p1        (Intercept)  -0.0174    0.101     -0.172  0.864 
 4 r2       p1        p1           -0.186     0.101     -1.84   0.0685
 5 r3       p1        (Intercept)   0.0214    0.0947     0.226  0.822 
 6 r3       p1        p1            0.0651    0.0944     0.690  0.492 
 7 r4       p1        (Intercept)   0.0708    0.103      0.686  0.495 
 8 r4       p1        p1           -0.142     0.103     -1.38   0.169 
 9 r5       p1        (Intercept)   0.134     0.101      1.33   0.186 
10 r5       p1        p1            0.0248    0.100      0.247  0.805 
# … with 19,190 more rows

I hope this makes sense now.