I am trying to make a model comparison (say, for hypothesis testing) of two GAMs (mgcv
package),
where both models include random effects smooth term (s(bs="re")
), and the
second model additionally include a factor variable. So:
gm0 <- gam(y ~ s(subject, bs = "re"))
gm1 <- gam(y ~ fac + s(subject, bs = "re"))
For reference, I will use a corresponding pair of LMMs:
lmm0 <- lmer(y ~ (1 | subject))
lmm1 <- lmer(y ~ fac + (1 | subject))
As expected, the fixed and random effects parameters are very close between
GAM and the LMM counterparts. The addition of fac
improves the LMM fit,
as it should be, but seemingly not for GAMs. Unlike in LMM, there is barely any
difference in the reported likelihoods (and so AIC) between the two GAM models.
Therefore, my question is: how to compare the fit of such models?
Here's an example:
library(lme4)
library(mgcv)
# Data with one explanatory variable (2-level factor), and 'm' subjects:
n <- 100 # no. of observations
m <- 20 # no. of subjects
set.seed(666)
fac <- gl(2, n / 2, labels = LETTERS[1:2])
subject <- gl(m, n / m, labels = letters[1:m])
beta <- c(-1, 1)
y <- rnorm(n, beta[fac], .25) + rnorm(m, sd = 1.5)[subject]
##
# LMM for comparison:
lmm0 <- lmer(y ~ (1 | subject), REML = FALSE)
lmm1 <- lmer(y ~ fac + (1 | subject), REML = FALSE)
# The two LMM differ by one *df* and the likelihood of
# the second model is higher:
anova(lmm0, lmm1)
## Models:
## lmm0: y ~ (1 | subject)
## lmm1: y ~ fac + (1 | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## lmm0 3 126.49 134.30 -60.243 120.49
## lmm1 4 123.86 134.28 -57.928 115.86 4.6311 1 0.0314 *
# Now equivalent models with GAM and RE-smooth term:
gm0 <- gam(y ~ s(subject, bs = "re"), method = "ML")
gm1 <- update(gm0, . ~ . + fac)
# Fixed effect coefficients, their SE, as well as RE variance components are
# almost identical between GAM and LMM:
# FX:
cbind(coef(gm1)[1:2], fixef(lmm1))
# SE:
cbind(sqrt(diag(vcov(gm1)[1:2, 1:2])),
sqrt(diag(vcov(lmm1))))
# RE variance components:
VarCorr(lmm1)
gam.vcomp(gm1)
# In GAM, unlike in LMM, there is virtually no difference in likelihood of the
# two models in a pair.
AIC(lmm0, lmm1, gm0, gm1)
## df AIC
## lmm0 3.0 126.5
## lmm1 4.0 122.9
## gm0 20.9 30.7 <--
## gm1 20.9 30.7 <--
# Curiously, this happens when 'fac' spans 'subject's, i.e. no subject takes
# two different levels of 'fac'. Reference dfs in 'gm0' (the null model) is 19,
# i.e. number of subjects - 1. But in 'gm1', where 'fac' is included, ref. df.
# is 18.
table(fac, subject)
## subject
## fac a b c d e f g h i j k l m n o p q r s t
## A 5 5 5 5 5 5 5 5 5 5 0 0 0 0 0 0 0 0 0 0
## B 0 0 0 0 0 0 0 0 0 0 5 5 5 5 5 5 5 5 5 5
summary(gm0)
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(subject) 18.92 19 248.6 <2e-16 ***
summary(gm1)
## edf Ref.df F p-value
## s(subject) 17.91 18 208 <2e-16 ***
# However, when 'fac' intersects 'subject's, i.e. when single subject can encompass
# different levels of 'fac', ref. df is again m - 1 = 19.
gm1b <- gam(y ~ sample(fac) + s(subject, bs = "re"))
summary(gm1b)
## edf Ref.df F p-value
## s(subject) 18.92 19 246.3 <2e-16 ***