Avoid writing large number of column names in a model formula with bs() terms

59 Views Asked by At

I want to use bs function for numerical variables in my dataset when fitting a logistic regression model.

df <- data.frame(a = c(0,1), b = c(0,1), d = c(0,1), e = c(0,1),
                  f= c("m","f"), output = c(0,1))
 
library(splines) 
model <- glm(output~ bs(a, df=2)+ bs(b, df=2)+ bs(d, df=2)+ bs(e, df=2)+
                      factor(f) ,
                      data = df, 
                      family = "binomial") 

In my actual dataset, I need to apply bs() to way more columns than this example. Is there a way I can do this without writing all the terms?

1

There are 1 best solutions below

0
Zheyuan Li On BEST ANSWER

We can use some string manipulation with sprintf, together with reformulate:

predictors <- c("a", "b", "d", "e")
bspl.terms <- sprintf("bs(%s, df = 2)", predictors)
other.terms <- "factor(f)"
form <- reformulate(c(bspl.terms, other.terms), response = "output")
#output ~ bs(a, df = 2) + bs(b, df = 2) + bs(d, df = 2) + bs(e, 
#    df = 2) + factor(f)

If you want to use a different df and degree for each spline, it is also straightforward (note that df can not be smaller than degree).

predictors <- c("a", "b", "d", "e")
dof <- c(3, 4, 3, 6)
degree <- c(2, 2, 2, 3)
bspl.terms <- sprintf("bs(%s, df = %d, degree = %d)", predictors, dof, degree)
other.terms <- "factor(f)"
form <- reformulate(c(bspl.terms, other.terms), response = "output")
#output ~ bs(a, df = 3, degree = 2) + bs(b, df = 4, degree = 2) + 
#    bs(d, df = 3, degree = 2) + bs(e, df = 6, degree = 3) + factor(f)

Prof. Ben Bolker: I was going to something a little bit fancier, something like predictors <- setdiff(names(df)[sapply(df, is.numeric)], "output").

Yes. This is good for safety. And of course, an automatic way if OP wants to include all numerical variables other than "output" as predictors.