I am writing an R package, where the main function takes a model, which may only have a single factor covariate (offsets are allowed). To make sure the user complies with this rule I need to check this.
As an example, let's look at the following four models:
set.seed(123)
n <- 10
## data
data <- data.frame(y = rnorm(n),
trt = rep(c(0, 1), each = n/2),
x = 1:n)
datan <- data
datan$trt <- as.factor(datan$trt)
## models
mod1 <- lm(y ~ factor(trt), data = data)
mod2 <- lm(y ~ offset(x) + as.factor(trt), data = data)
mod3 <- lm(y ~ trt, data = datan)
mod4 <- glm(y ~ trt + offset(x), data = data)
mod5 <- lm(y ~ x + as.factor(trt), data = data)
Models 1, 2 and 3 are ok, models 4 and 5 are not ok (model 4 has a non-factor variable trt
, model 5 has a second covariate x
).
How do I check this using R? Optimally I'd get a TRUE
for a model that's ok, and a FALSE
for a problematic model.
This should work not only with lm()
and glm()
, but also with survreg()
and coxph()
(from package survival). Something that might be useful is looking at the formula eval(getCall(mod1)$formula)
and the data (data
/ datan
).
As indicated in the previous reply by @LAP you can use the
terms()
from these models. However, I would recommend to look at theattr(..., "factors")
andattr(..., "dataClasses")
instead of going to the$model
which requires that the entiremodel.frame()
is stored in the model. This may or may not be the case. Specifically, when re-fitting multiple models you might want to be able to not store the model frame each time.So one idea would be to proceed in the following steps:
attr(..., "factors")
has not exactly one column, the you can returnFALSE
.attr(..., "dataClasses")
if it is"factor"
/"ordered"
and then returnTRUE
, otherwiseFALSE
.R code:
This appears to work well for single-part
formula
-based objects.Dummy data with numeric/factor/ordered
trt
:Various formula specifications:
Expected results for
d1
,d2
, andd3
, respectively:Checks for
lm
without storing the model frame:Checks for
survreg
(Gaussian) andcoxph
. (The latter throws a lot of warnings about non-convergence which is not surprising given the dummy data structure. The checks still work as intended.)Note: If you have multi-part
Formula
-based objects this function might fail and the underlying tests would need to be adapted. Examples for the latter could include count regression models (zeroinfl
,hurdle
), multinomial logit (mlogit
), instrumental variables (ivreg
), heteroscedastic models (vglm
,betareg
,crch
) etc. These might have formulas likey ~ trt | 1
ory ~ trt | trt
ory ~ trt | x
which may or may not still be feasible in your framework.