mgcv predict with bivariate te smooth : average over one covariate

50 Views Asked by At

Say I'm using mgcv in R to model a time trend with seasonality that changes over time, say te(year, month) ... Is it possible or even logical to generate a smooth estimate or fitted values for year averaged over the month term, ie the pure time trend. I realise I can do this with separate smooths for each term but just not sure about the bivariate case. Dummy example code below but I appreciate this is not a great example.

set.seed(123)

# Specify the number of observations
n <- 100

# Generate dummy data
data <- gamSim(n = n, dist = "normal", scale = 2)

# Add year and month as covariates
data$year <- rep(1:(n/12), each = ceiling(n/12))[1:n]
data$month <- rep(1:12, length.out = n)

# Create a bivariate tensor product smooth for the seasonal component
# using te() in the gam() function, and include time_index
model <- gam(y ~ te(year, month), data = data.frame(data))
1

There are 1 best solutions below

1
On

I would think that you'd get something reasonable if you generated prediction data for each year and a reasonable number of points over the interval 0-12 (for the month marginal smooth), and then averaged the predicted values (or the values of the smooth) for each year. The more values you use in year the smoother your resulting average trend will be over years and the larger then number of values over the months, the more accurate the average trend will be.

library("gratia")
library("dplyr")
library("ggplot2")

ds <- data_slice(model,
  year = evenly(year, n = 250),
  month = evenly(month, n = 250))

fv <- fitted_values(model, data = ds)
avg <- fv |>
  group_by(year) |>
  summarise(avg_trend = mean(.fitted))

avg |>
  ggplot(aes(x = year, y = avg_trend)) +
  geom_line() +
  geom_line(data = fv,
    mapping = aes(y = .fitted,
      colour = month,
      group = month),
    alpha = 0.2) +
  scale_colour_viridis_c(option = "plasma")

which produces

enter image description here

A more direct way to get something like this would be to decompose the tensor product in the main effects of the marginal smooths and their pure interaction:

knots <- list(month = c(0.5, 12.5))
model2 <- gam(y ~ s(year, bs = "cr", k = 8) +
    s(month, bs = "cc", k = 12) +
    ti(year, month, bs = c("cr", "cc"), k = c(8, 12)),
  method = "REML",
  knots = knots,
  data = data.frame(data))

That s(year) term looks like this:

enter image description here

Let's compare:

ds2 <- model2 |>
  data_slice(year = evenly(year, n = 250))
fv2 <- model2 |>
  fitted_values(data = ds2, exclude = smooths(model2)[2:3])

fv2 |>
  select(year, .fitted) |>
  left_join(avg) |>
  tidyr::pivot_longer(cols = -year) |>
  mutate(name = forcats::fct_recode(name,
    "s(year)" = ".fitted", "te(year,month)" = "avg_trend")) |>
  ggplot(aes(x = year, y = value, colour = name)) +
  geom_line() +
  scale_colour_manual(values = c("s(year)" = "#28E2E5",
      "te(year,month)" = "#DF536B"),
    breaks = c("s(year)", "te(year,month)")) +
  labs(colour = "Model") +
  theme(legend.position = "top")

producing

enter image description here

I think the version from the te() form shows more variation because that is an average of the smooth curves over the full fitted surface, whereas the one from the decomposed model is like an average smooth of year through the data, where we've explicitly removed (or not considered) the monthly variation because that is bound up in its own main effect smooth.

For more on what the s(year) term means in the decomposed form, see my answer to a related question on CrossValidated, particularly the answer to Q2.

Which of the types works best for you will depend on exactly what you are trying to show. I think I would go with the decomposed form myself.