I am getting a strange output using tab_model() after fitting a beta regression model that does not seem to make sense. The intercept and the coefficients dont seem to correspond to the plot. For example the CI for soil_moisture seems way to high. Also, shouldnt there be two intercepts since I have to regression lines?
What am I missing?
This is my data:
df <- structure(list(weed_coverage = c(0.002, 0.01, 0.015, 0.01, 0.001,
0.03, 0.006, 0.012, 0.03, 0.01, 0.002, 0.05, 0.004, 0.02, 0.02,
0.006, 0.03, 0.01, 0.015, 0.01), soil_moisture = c(0.1125, 0.1343,
0.1662, 0.3402, 0.2195, 0.1923, 0.2277, 0.2577, 0.148, 0.2715,
0.104, 0.1495, 0.2788, 0.3477, 0.1835, 0.3175, 0.134, 0.3488,
0.3477, 0.1097), distance = structure(c(2L, 2L, 1L, 2L, 1L, 1L,
1L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L), .Label = c("2",
"5"), class = "factor"), wc_pr = c(`1` = 0.0144421159096612,
`2` = 0.0148173851570077, `3` = 0.0146600960637327, `4` = 0.0188698067335207,
`5` = 0.0132256975788894, `6` = 0.0139395025623511, `7` = 0.0130176686719618,
`8` = 0.0171297102952414, `9` = 0.0150581171360966, `10` = 0.0119600057973879,
`11` = 0.0142983464947494, `12` = 0.0150847074475541, `13` = 0.0117921770637613,
`14` = 0.019036340203918, `15` = 0.0141784723499839, `16` = 0.0109405819057696,
`17` = 0.0148121562892363, `18` = 0.0190608859962305, `19` = 0.0103185336737258,
`20` = 0.0163480105406738)), class = "data.frame", row.names = c(NA,
-20L))
This is my code:
library(sjPlot)
library(betareg)
library(ggplot2)
betareg (weed_coverage ~ soil_moisture * distance, data = df) -> model_b # fit beta regression model
tab_model(model_b)
df %>% mutate(wc_pr= predict(model_b , type = "response")) -> df # create column with prediction values for weed_coverage
ggplot(df, aes(x = soil_moisture, y = weed_coverage, color = distance)) + # Plot the model
geom_point(size = 2, shape = 21) +
geom_line(aes(y = wc_pr, color = distance), data = df)+
theme_bw()
The output of tab_model(model_b):
This is the plot I get:
Okay, so there are a number of things happening here that I'm going to go through. The first is the easiest, which is to clarify that this model only has one intercept, which is the estimated value of
weed_coverage
when your independent variables are equal to zero. For this reason, these values are not always directly interpretable.The two regression lines on your plot are the predicted slope of
soil_moisture
at the two different values of distance. This is because there is an interaction, so the slope of the line forsoil_moisture
is dependent on the level ofdistance
.Now, the CIs are being correctly reported by
tab_model()
, but the default behaviour of most functions in the sjPlot package is to transform values on the log scale using theexp()
function, to get something closer to the probability values. If you want the original model values in the log space where the model was fit, you have to specifytransform=NULL
like this:You will now notice that the model is reporting the same numbers as if you call the following:
The CIs are very large, but that is because all of the values for
weed_coverage
are very low. In the log-odds space, small percentage or proportion changes become very large when they are close to 0 or 1. In this example model, the model is very uncertain, which is why the back-transformed CIs on the probability scale are so large. It is also worth noting that these estimates are for the change inweed_coverage
for a change in 1 ofsoil_moisture
. If you scaled this variable, the model would not change but the numbers would be easier to interpret in the context of your data, as instead of estimating the uncertainty in the regression line far beyond the confines of your data, everything will be reporting for a change of 1 standard deviation ofsoil_moisture
:Please let me know if there is anything here that you do not understand.
EDIT: To clear up the question you had in the comment about the intercept, I plotted the two regression lines from the regression with the scaled independent variable. The intercept in this model corresponds to the value of
weed_coverage
when thescaled_soil_moisture
equals zero and the distance is equal to 2 (because distance is a two-level factor, the default behaviour is usually to create a dummy variable with the first level being represented by the intercept). As you can see, on the red line when the x-axis is equal to 0, the y-axis is approximately 0.01 (it has been rounded down). This is in contrast the the non-scaled model, where the intercept is not pictured on the graph because there are no values of zero in the data. If you continued the red line past the data to where x=0, you would see that it crosses the y-axis around 0.02.This page explains this is a fairly intuitive way.