extract coefficients from more than one linear regression model with multiple terms in r

61 Views Asked by At

I am trying to generate a data frame of intercepts and coefficients for multiple different linear regression models generated from subsets of a data set. I cannot share my data unfortunately but I can explain it using the mtcars set. I am creating a regression model predicting mpg from cyl, hp, and wt for each value of carb. After digging around for a while, I found many examples that do what I want but only for a model with 1 predictor (like mpg~wt). It all falls apart when I add the other terms. This is what I have based my work so far on: https://community.rstudio.com/t/extract-slopes-by-group-broom-dplyr/2751/8 Efficient way to extract coefficients from many linear regression lines

This is what I have tried

library(tidyverse);library(broom)
df <- mtcars
tryme <- df %>%
  split(.$carb)%>%
  map(~lm(mpg~cyl+hp+wt,data=.x)) %>%
  map_df(tidy)

with this result

    term            estimate    std.error   statistic   p.value
1   (Intercept) 46.034633   7.68430306  5.99073626  9.31E-03
2   cyl         2.650503624 4.14371413  0.63964442  5.68E-01
3   hp          -0.230007961    0.1573354   -1.46189576 2.40E-01
4   wt          -5.231999683    9.23136027  -0.56676368 6.11E-01
5   (Intercept) 39.84451509 2.95537984  13.48202845 1.03E-05
6   cyl         -0.846094229    0.93995084  -0.90014732 4.03E-01
7   hp          -0.007452737    0.03998485  -0.18638901 8.58E-01
8   wt          -4.133340298    1.42757472  -2.89535829 2.75E-02
9   (Intercept) 17.50267062 22.13706712 0.79064993  5.74E-01
10  cyl         NA          NA          NA          NA
11  hp          NA          NA          NA          NA
12  wt          -0.3115727  5.73067255  -0.05436931 9.65E-01
13  (Intercept) 45.33390978 12.93999647 3.50339429  1.28E-02
14  cyl         -4.195214198    3.492613    -1.20116778 2.75E-01
15  hp          0.029361878 0.04927895  0.59583008  5.73E-01
16  wt          -1.239041102    1.03937377  -1.19210349 2.78E-01
17  (Intercept) 19.7            NaN         NaN         NaN
18  cyl         NA          NA          NA          NA
19  hp          NA          NA          NA          NA
20  wt          NA          NA          NA          NA
21  (Intercept) 15          NaN         NaN         NaN
22  cyl         NA          NA          NA          NA
23  hp          NA          NA          NA          NA
24  wt          NA          NA          NA          NA

what I want is a table that looks like this:

carb    intercept   cyl         hp          wt
1   46.034633   2.650503624 -0.230007961    -5.231999683
2   39.84451509 -0.846094229    -0.007452737    -4.133340298
3   17.50267062 NA          NA          -0.3115727
4   45.33390978 -4.195214198    0.029361878 -1.239041102
6   19.7            NA          NA          NA
8   15          NA          NA          NA

I don't know how to bring the value of the grouping variable in. If I can get that added to what I already have I know how to transpose the data into the form I need.

I am updating a bit, as an answer here gave me what I wanted, but is causing an error that I can't seem to fix. the solution uses lmList from the nlme library. The code I am using is this

fit <- lmList(SentLength~Unified_UPPER+Unified_LOWER+GRADE | SentType, data=df, na.omit)
and this is the data I am running it on:
SentType    SentLength  Unified_UPPER   Unified_LOWER   GRADE
Jail        0.06578947  0.06666667      0.06666667      3
Other       0           6               0               4
Other       0           6               6               1
Probation   6           0               0               1
Other       0           9               0               6
Jail        1.41447368  6               0               3
Probation   6           0               0               1
Probation   6           0               0               1
Probation   12          0               0               2
Jail        6           16              6               6
Prison      36          27              21              5
Probation   24          9               0               6
Jail        0.23026316  0.5             0               1
Jail        0.09868421  0.06666667      0.06666667      1
Probation   60          27              21              6
Probation   6           1               0               1
Probation   24          0               0               3
Prison      36          54              36              8
Probation   24          9               0               6
Probation   6           0               0               1
Probation   0           0.06666667      0.06666667      1
Probation   24          0               0               3
Jail        0.13157895  1               0.06666667      1
Jail        0.09868421  1               0.1             1
Prison      42          60              48              8
Probation   6           0               0               2
Other       0           0               0               1
Prison      6           6               6               1
Prison      6           6               6               1
Other       0           16              9               7

I get this error: Error in na.fail.default(data) : missing values in object despite no missing values in the data. Can anyone help with this?

3

There are 3 best solutions below

5
Roland On BEST ANSWER

It always amazes me that people insist on using tidyverse mumbo-jumbo for this when it is so incredibly easy with package nlme, which is a "recommended package", meaning it should be part of your R installation and you don't even need to install it.

library(nlme)
fit <- lmList(mpg ~ cyl + hp + wt | carb, data = mtcars)
coef(fit)
#  (Intercept)        cyl           hp         wt
#1    46.03463  2.6505036 -0.230007961 -5.2319997
#2    39.84452 -0.8460942 -0.007452737 -4.1333403
#3    17.50267         NA           NA -0.3115727
#4    45.33391 -4.1952142  0.029361878 -1.2390411
#6    19.70000         NA           NA         NA
#8    15.00000         NA           NA         NA

Note that the row names are the carb values.

There is also a summary method producing convenient output. However, read help("summary.lmList") regarding the pool parameter if you use it.

0
Onyambu On

Use summarise with unnest_wider:

df %>%
   summarise(a = list(coef(lm(mpg~cyl + hp + wt, cur_data()))), .by = carb)%>%
   unnest_wider(a, names_sep = "_")

# A tibble: 6 × 5
   carb `a_(Intercept)`  a_cyl     a_hp   a_wt
  <dbl>           <dbl>  <dbl>    <dbl>  <dbl>
1     4            45.3 -4.20   0.0294  -1.24 
2     1            46.0  2.65  -0.230   -5.23 
3     2            39.8 -0.846 -0.00745 -4.13 
4     3            17.5 NA     NA       -0.312
5     6            19.7 NA     NA       NA    
6     8            15   NA     NA       NA    

EFFICIENT WAY IN BASE R

Seems you are only interested in the coefficients. In base R you could do:

mtcars$carb <- factor(mtcars$carb)
a <- coef(lm(mpg~ carb/(cyl+hp+wt) + 0, mtcars))
a
       carb1        carb2        carb3        carb4        carb6        carb8    carb1:cyl 
46.034632999 39.844515085 17.502670623 45.333909777 19.700000000 15.000000000  2.650503624 
   carb2:cyl    carb3:cyl    carb4:cyl    carb6:cyl    carb8:cyl     carb1:hp     carb2:hp 
-0.846094229           NA -4.195214198           NA           NA -0.230007961 -0.007452737 
    carb3:hp     carb4:hp     carb6:hp     carb8:hp     carb1:wt     carb2:wt     carb3:wt 
          NA  0.029361878           NA           NA -5.231999683 -4.133340298 -0.311572700 
    carb4:wt     carb6:wt     carb8:wt 
-1.239041102           NA           NA 

The coefficients above matches the coefficients you have. You can easily transform to matrix:

matrix(a, nlevels(mtcars$carb))
         [,1]       [,2]         [,3]       [,4]
[1,] 46.03463  2.6505036 -0.230007961 -5.2319997
[2,] 39.84452 -0.8460942 -0.007452737 -4.1333403
[3,] 17.50267         NA           NA -0.3115727
[4,] 45.33391 -4.1952142  0.029361878 -1.2390411
[5,] 19.70000         NA           NA         NA
[6,] 15.00000         NA           NA         NA

Which looks like yours but without names. In the naming part, if you know regular expressions, it is easy as you can directly get the names from the vector a

dm <- list(unique(sub(":.*", "", names(a))),
           replace(unique(gsub(".*:|.*\\d+", "", names(a))), 1, 'intercepy'))
matrix(a,  nlevels(mtcars$carb), dimnames = dm)
      intercepy        cyl           hp         wt
carb1  46.03463  2.6505036 -0.230007961 -5.2319997
carb2  39.84452 -0.8460942 -0.007452737 -4.1333403
carb3  17.50267         NA           NA -0.3115727
carb4  45.33391 -4.1952142  0.029361878 -1.2390411
carb6  19.70000         NA           NA         NA
carb8  15.00000         NA           NA         NA

Suppose you do not know regex, then use split:

b <- mtcars$carb
d <- split(a, rep(levels(b), nlevels(b), length=length(a)))
array2DF(structure(d, dim = nlevels(b)))

 Var1    carb1  carb1:cyl     carb1:hp   carb1:wt
1    1 46.03463  2.6505036 -0.230007961 -5.2319997
2    2 39.84452 -0.8460942 -0.007452737 -4.1333403
3    3 17.50267         NA           NA -0.3115727
4    4 45.33391 -4.1952142  0.029361878 -1.2390411
5    6 19.70000         NA           NA         NA
6    8 15.00000         NA           NA         NA
0
I_O On

One approach (taking advantage of nesting data and the fact that one can stuff complex structures like models, plots etc. into a dataframe's cell by enclosing them into a list):

library(dplyr)
library(tidyr)

mtcars |>
  nest_by(carb) |>
  transmute(coeffs = lm(mpg ~ cyl + hp + wt, data) |>
              coef() |>
              list() ## !
            ) |>
  unnest_wider(coeffs)
## + # A tibble: 6 x 5
##    carb `(Intercept)`    cyl       hp     wt
##   <dbl>         <dbl>  <dbl>    <dbl>  <dbl>
## 1     1          46.0  2.65  -0.230   -5.23 
## 2     2          39.8 -0.846 -0.00745 -4.13 
## 3     3          17.5 NA     NA       -0.312
## 4     4          45.3 -4.20   0.0294  -1.24 
## 5     6          19.7 NA     NA       NA    
## 6     8          15   NA     NA       NA