Predictions computed by ggpredict() and ggemmeans() for a mixed-effect model differ: why?

870 Views Asked by At

I used functions ggpredict() and ggemmeans() from package ggeffects 1.3.0 to calculate mean estimates and confidence intervals (hereafter: CI) for a mixed-effect model. These functions rely on predict() and on emmeans() and make their outputs ggplot-friendly. The values predicted/estimated by the two functions differ both in their mean values and in their CI. Why?

The following reproducible example is based on dataset RIKZ (Janssen e Mulder 2005; Zuur et al. 2007), which looks at how species richness (number of species) varies with the height of sampling stations compared to mean tidal level (NAP, in meters) and the level of exposure (factor with three levels: low, medium and high):

rm(list=ls())
if (!require(pacman)) install.packages('pacman'); library(pacman)
p_load(emmeans)
p_load(ggplot2)
p_load(ggpubr)
p_load(ggeffects)
p_load(lme4, lmerTest, glmmTMB)
p_load(RCurl)
# get data:
RIKZ <- read.csv(text = RCurl::getURL(
"https://raw.githubusercontent.com/marcoplebani85/datasets/master/RIKZ.csv"))
str(RIKZ)
# "Exposure" is a factor:
RIKZ$Exposure <- as.factor(RIKZ$Exposure)

Here I fit a generalised mixed-effect model with Poisson-distributed residuals to the data using glmmTMB():

mem1 <- glmmTMB(Richness ~ NAP+Exposure + (1 | Beach),
                family="poisson",
                data = RIKZ, REML=T)

Model predictions and CI according to ggeffects::ggpredict(), without accounting for the uncertainty of random effects (see this page on why accounting for it or not):

richness.predicted <- ggpredict(mem1, 
terms=c("NAP", "Exposure"), type="fixed")

Predictions and CI for the same model according to ggeffects::ggemmeans(), without accounting for the uncertainty of random effects:

richness.emmeans <- ggemmeans(mem1, 
terms=c("NAP", "Exposure"), type="fixed")

Graphs representing the two sets of model predictions and CI together with the observed data:

p1 <- plot(richness.predicted, add.data=T) + 
    labs(title="Predictions by ggpredict()") + ylim(0,45) +
    theme(text = element_text(size = 15))
p2 <- plot(richness.emmeans, add.data=T) + 
    labs(title="Predictions by ggemmeans()") + ylim(0,45) +
    theme(text = element_text(size = 15))
ggarrange(p1, p2, ncol=2, labels=c("(a)", "(b)"), 
    common.legend=T, legend="bottom")

enter image description here

Why do the two sets of mean estimates and confidence intervals differ?

UPDATE:

This page explains the differences between the two functions:

[...] effects returned by ggpredict() [are] conditional effects (i.e. these are conditioned on certain (reference) levels of factors), while ggemmeans() returns marginal means, since the effects are “marginalized” (or “averaged”) over the levels of factors. However, these differences only apply to non-focal terms, i.e. remaining variables that are not specified in the terms argument.

It continues:

When all categorical predictors are specified in terms and further (non-focal) terms are only numeric, results are also identical (as both ggpredict() and ggemmeans() use the mean value by default to hold non-focal numeric variables constant).

I suspect the reason for the differences in the predictions obtained by the two functions for mixed-effect models lies in how they handle random effects, but I don't know how.


REFERENCES:

  1. Janssen, G. and Mulder, S. (2005), Zonation of macrofauna across sandy beaches and surf zones along the Dutch coast. Oceanologia 47(2): 265-282.
  2. Zuur, A.F., Ieno, E.N., and Smith, G.M. (2007), Analysing ecological data. Springer Science & Business Media
  3. https://strengejacke.github.io/ggeffects/articles/introduction_randomeffects.html
  4. https://strengejacke.github.io/ggeffects/articles/technical_differencepredictemmeans.html
2

There are 2 best solutions below

1
On

This really a comment, not a full answer, but perhaps it could point into the right direction to understand this subtle difference between ggpredict and ggemmeans which is actually a difference between predict.glmmTMB and emmeans.

glmmTMB uses two estimation methods: ML (maximum likelihood) and REML (restricted maximum likelihood). One chooses one or the other method by setting the argument REML appropriately.

If the estimation method is ML, then ggpredict and ggemeans give the same result. The difference occurs when the estimation method is REML and the model has random effects.

fit_glmmTMB <- glmmTMB(
  Richness ~ NAP + Exposure + (1 | Beach),
  family = poisson,
  data = RIKZ,
  REML = FALSE # Use ML to fit the model.
)

ggpredict(
  fit_glmmTMB, terms = fixed_terms, type = "fixed"
)
#> # Predicted counts of Richness
#> 
#> # Exposure = medium
#> 
#>   NAP | Predicted |         95% CI
#> ----------------------------------
#> -1.40 |     13.97 | [10.64, 18.35]
#> -0.80 |     10.30 | [ 8.22, 12.91]
#> -0.20 |      7.60 | [ 6.19,  9.32]
#>  0.40 |      5.60 | [ 4.51,  6.95]
#>  1.00 |      4.13 | [ 3.20,  5.34]
#>  2.20 |      2.25 | [ 1.53,  3.29]
#>
#> [The rest of the output is cut.]
ggemmeans(
  fit_glmmTMB, terms = fixed_terms, type = "fixed"
)
#> # Predicted counts of Richness
#> 
#> # Exposure = medium
#> 
#>   NAP | Predicted |         95% CI
#> ----------------------------------
#> -1.40 |     13.97 | [10.64, 18.35]
#> -0.80 |     10.30 | [ 8.22, 12.91]
#> -0.20 |      7.60 | [ 6.19,  9.32]
#>  0.40 |      5.60 | [ 4.51,  6.95]
#>  1.00 |      4.13 | [ 3.20,  5.34]
#>  2.20 |      2.25 | [ 1.53,  3.29]
#> 
#> [The rest of the output is cut.]

For comparison I also fitted the Poisson GLMM with lme4::glmer. This function doesn't have a method option and uses ML. Again, no difference.

fit_glmer <- glmer(
  Richness ~ NAP + Exposure + (1 | Beach),
  family = poisson,
  data = RIKZ
)

ggpredict(
  fit_glmer, terms = fixed_terms, type = "fixed"
)
#> # Predicted counts of Richness
#> 
#> # Exposure = medium
#> 
#>   NAP | Predicted |         95% CI
#> ----------------------------------
#> -1.40 |     13.97 | [10.64, 18.34]
#> -0.80 |     10.30 | [ 8.22, 12.91]
#> -0.20 |      7.60 | [ 6.19,  9.32]
#>  0.40 |      5.60 | [ 4.51,  6.95]
#>  1.00 |      4.13 | [ 3.20,  5.34]
#>  2.20 |      2.25 | [ 1.53,  3.29]
#> 
#> [The rest of the output is cut.]
ggemmeans(
  fit_glmer, terms = fixed_terms, type = "fixed"
)
#> # Predicted counts of Richness
#> 
#> # Exposure = medium
#> 
#>   NAP | Predicted |         95% CI
#> ----------------------------------
#> -1.40 |     13.97 | [10.64, 18.34]
#> -0.80 |     10.30 | [ 8.22, 12.91]
#> -0.20 |      7.60 | [ 6.19,  9.32]
#>  0.40 |      5.60 | [ 4.51,  6.95]
#>  1.00 |      4.13 | [ 3.20,  5.34]
#>  2.20 |      2.25 | [ 1.53,  3.29]
#>
#> [The rest of the output is cut.]
2
On

For type = "fixed", predict(..., re.form = NA) is used. While for type = "random", predict(..., re.form = NULL) is used. To get same results, try:

ggpredict(mem1,
  terms = c("NAP", "Exposure"), type = "random", interval = "confidence"
)

See different outputs here:

rm(list = ls())
if (!require(pacman)) install.packages("pacman")
#> Loading required package: pacman
library(pacman)
p_load(emmeans)
p_load(ggplot2)
p_load(ggpubr)
p_load(ggeffects)
p_load(lme4, lmerTest, glmmTMB)
p_load(RCurl)
# get data:
RIKZ <- read.csv(text = RCurl::getURL(
  "https://raw.githubusercontent.com/marcoplebani85/datasets/master/RIKZ.csv"
))
# "Exposure" is a factor:
RIKZ$Exposure <- as.factor(RIKZ$Exposure)

mem1 <- glmmTMB(Richness ~ NAP + Exposure + (1 | Beach),
  family = "poisson",
  data = RIKZ, REML = T
)

d <- new_data(mem1, c("NAP [-1, 0, 1]", "Exposure"))
predict(mem1, newdata = d, type = "response", re.form = NULL)
#> [1]  5.442088  3.293444  1.993127 20.725116 12.542431  7.590432 11.230864
#> [8]  6.796697  4.113227
predict(mem1, newdata = d, type = "response", re.form = NA)
#> [1]  5.526114  3.306028  1.977849 20.919609 12.515270  7.487329 11.598996
#> [8]  6.939163  4.151392
emmeans::emmeans(mem1, c("NAP", "Exposure"), at = list(NAP = c(-1, 0, 1)), type = "response")
#>  NAP Exposure  rate    SE  df asymp.LCL asymp.UCL
#>   -1 high      5.44 0.984 Inf      3.82      7.76
#>    0 high      3.29 0.543 Inf      2.38      4.55
#>    1 high      1.99 0.359 Inf      1.40      2.84
#>   -1 low      20.73 5.126 Inf     12.76     33.65
#>    0 low      12.54 3.015 Inf      7.83     20.09
#>    1 low       7.59 1.936 Inf      4.60     12.51
#>   -1 medium   11.23 1.695 Inf      8.35     15.10
#>    0 medium    6.80 0.923 Inf      5.21      8.87
#>    1 medium    4.11 0.647 Inf      3.02      5.60
#> 
#> Confidence level used: 0.95 
#> Intervals are back-transformed from the log scale


# fixed effects with confidence intervals, conditioned on random effects
# i.e. using `re.form = NULL`
ggpredict(mem1,
  terms = c("NAP [-1, 0, 1]", "Exposure"), type = "random", interval = "confidence"
)
#> # Predicted counts of Richness
#> 
#> # Exposure = high
#> 
#> NAP | Predicted |       95% CI
#> ------------------------------
#>  -1 |      5.44 | [3.82, 7.76]
#>   0 |      3.29 | [2.38, 4.55]
#>   1 |      1.99 | [1.40, 2.84]
#> 
#> # Exposure = low
#> 
#> NAP | Predicted |         95% CI
#> --------------------------------
#>  -1 |     20.73 | [12.76, 33.65]
#>   0 |     12.54 | [ 7.83, 20.09]
#>   1 |      7.59 | [ 4.60, 12.51]
#> 
#> # Exposure = medium
#> 
#> NAP | Predicted |        95% CI
#> -------------------------------
#>  -1 |     11.23 | [8.35, 15.10]
#>   0 |      6.80 | [5.21,  8.87]
#>   1 |      4.11 | [3.02,  5.60]
#> 
#> Adjusted for:
#> * Beach = NA (population-level)

ggemmeans(mem1,
  terms = c("NAP [-1, 0, 1]", "Exposure"), type = "fixed"
)
#> # Predicted counts of Richness
#> 
#> # Exposure = high
#> 
#> NAP | Predicted |       95% CI
#> ------------------------------
#>  -1 |      5.44 | [3.82, 7.76]
#>   0 |      3.29 | [2.38, 4.55]
#>   1 |      1.99 | [1.40, 2.84]
#> 
#> # Exposure = low
#> 
#> NAP | Predicted |         95% CI
#> --------------------------------
#>  -1 |     20.73 | [12.76, 33.65]
#>   0 |     12.54 | [ 7.83, 20.09]
#>   1 |      7.59 | [ 4.60, 12.51]
#> 
#> # Exposure = medium
#> 
#> NAP | Predicted |        95% CI
#> -------------------------------
#>  -1 |     11.23 | [8.35, 15.10]
#>   0 |      6.80 | [5.21,  8.87]
#>   1 |      4.11 | [3.02,  5.60]

# fixed effects with prediction intervals, conditioned on random effects
# i.e. reform = NULL
ggpredict(mem1,
  terms = c("NAP [-1, 0, 1]", "Exposure"), type = "random"
)
#> # Predicted counts of Richness
#> 
#> # Exposure = high
#> 
#> NAP | Predicted |        95% CI
#> -------------------------------
#>  -1 |      5.44 | [0.74, 39.88]
#>   0 |      3.29 | [0.45, 24.01]
#>   1 |      1.99 | [0.27, 14.60]
#> 
#> # Exposure = low
#> 
#> NAP | Predicted |         95% CI
#> --------------------------------
#>  -1 |     20.73 | [2.75, 156.08]
#>   0 |     12.54 | [1.67,  94.15]
#>   1 |      7.59 | [1.00,  57.38]
#> 
#> # Exposure = medium
#> 
#> NAP | Predicted |        95% CI
#> -------------------------------
#>  -1 |     11.23 | [1.55, 81.52]
#>   0 |      6.80 | [0.94, 49.13]
#>   1 |      4.11 | [0.57, 29.91]
#> 
#> Adjusted for:
#> * Beach = NA (population-level)
#> 
#> Intervals are prediction intervals. Use `interval = "confidence"` to
#>   return regular confidence intervals.

# fixed effects with confidence intervals, using `re.form = NA`
ggpredict(mem1,
  terms = c("NAP [-1, 0, 1]", "Exposure"), type = "fixed"
)
#> # Predicted counts of Richness
#> 
#> # Exposure = high
#> 
#> NAP | Predicted |       95% CI
#> ------------------------------
#>  -1 |      5.53 | [4.15, 7.36]
#>   0 |      3.31 | [2.57, 4.26]
#>   1 |      1.98 | [1.48, 2.64]
#> 
#> # Exposure = low
#> 
#> NAP | Predicted |         95% CI
#> --------------------------------
#>  -1 |     20.92 | [15.91, 27.51]
#>   0 |     12.52 | [ 9.72, 16.11]
#>   1 |      7.49 | [ 5.55, 10.09]
#> 
#> # Exposure = medium
#> 
#> NAP | Predicted |        95% CI
#> -------------------------------
#>  -1 |     11.60 | [9.47, 14.21]
#>   0 |      6.94 | [5.85,  8.23]
#>   1 |      4.15 | [3.29,  5.23]
#> 
#> Adjusted for:
#> * Beach = NA (population-level)

Created on 2023-09-23 with reprex v2.0.2