In this question / answer from 5 years ago about logLik.lm()
and glm()
, it was pointed out that code comments in the R stats module suggest that lm()
and glm()
are both internally calculating some kind of scale or dispersion parameter--presumably one which describes the estimated dispersion of the observation values being predicted by the regression.
This naturally begets another question: if it's truly a real parameter being estimated by the fit algorithm somewhere (or even if it's just some kind of implicit / effective parameter), how do I access this parameter from the resulting fit object?
I've produced a MWE (plus supporting setup / plot code) below:
Part 1 constructs some simulated input data, which we'll fit to a straight line (implying two fit parameters are expected). Given the question is about a hidden, internally modeled dispersion parameter, I wanted to make sure the fit algorithm is forced to do something interesting, so therefore 10% of the points have been deliberately modeled as outliers. If you understand what's shown in the plot below, then you can probably skip reading this portion of the code.
Part 2 is the main body of the MWE, illustrating the point that my question is asking about: it runs
glm()
on the input data and examines some of the results, demonstrating thatlogLik()
claims three parameter estimates, in apparent disagreement withglm()
which seems to give two.Part 3 just produces a little supplementary figure based on the input data and results. It's only included for completeness & reproducibility; you can probably skip reading it too.
library(tidyverse)
# ---- Part 1: Simulate sample data, with 10% outliers ----
# Set x values to stepwise 1D grid
n <- 100
xlo <- 0
xhi <- 10
step <- (xhi - xlo) / n
x <- seq(xlo+step/2, xhi-step/2, step)
# Set slope and intercept of a simple line
intercept <- 2
slope <- 0.2
# By default, initially, set 100% of y observations to have standard
# deviation of 0.2, but then override 10% of those at random indices
# so that they become outliers with standard deviation of 5
frac_outlier <- 0.1
sd <- rep(0.2, n)
is_outlier <- rep(FALSE, n)
set.seed(100)
# Generates 10 non-repeating random integer indices between 1 and n
ranidx <- order(runif(n))[1:round(frac_outlier*n)]
sd[ranidx] <- 5
is_outlier[ranidx] <- TRUE
set.seed(200)
# Generate simulated y data, with 90% of points have sd = 0.2, and 10% of
# points having sd = 5 (i.e., they are outliers)
y <- intercept + slope * x + rnorm(n, sd=sd)
# Package simulated data into a tibble
tbl <- tibble(x, y, is_outlier)
# ---- Part 2: Perform regression and demonstrate unexpected results ----
# Estimate fit parameters using glm
fitobj <- glm(formula = y ~ x, data = tbl)
# Verify that the example fit produces two fit coefficients
message("\nEstimated fit parameters are:")
print(fitobj$coefficients)
# Verify that the log-likelihood function says there are effectively three
# fit parameters (i.e., the 3rd estimated parameter presumably being the
# effective "scale" or "dispersion" parameter of the y observation error
# distribution, as described in another stackoverflow question)
ll <- logLik(fitobj)
message(paste0("\nlogLik number of estimated fit parameters: ", attr(ll, "df")))
# Verify that the Iteratively Reweighted Least Squares algorithm promised
# by the glm documentation does not actually seem to perform any re-weighting
message(paste0("\nNumber of weights equal to 1: ", sum(fitobj$weights == 1)))
message(paste0("Number of weights not equal to 1: ", sum(fitobj$weights != 1)))
# Print fit object attributes. Crux of this post: where in this list would
# we find the estimated scale parameter (i.e., parameter #3) which
# characterizes the glm algorithm's internal estimate of the y observation
# error?
message("\nFit object attributes:")
print(attributes(fitobj))
# ---- Part 3 (supplementary): Plot results to use as extra visual aid ----
plt <- ggplot(tbl, aes(x=x, y=y)) +
geom_point(aes(color=is_outlier)) +
scale_color_manual(values=c("black", "red")) +
geom_smooth(method = glm, formula = y ~ x)
print(plt)
Here's the console output produced by Part 2 of the code snippet above:
Estimated fit parameters are:
(Intercept) x
2.1012286 0.1869693
logLik number of estimated fit parameters: 3
Number of weights equal to 1: 100
Number of weights not equal to 1: 0
Fit object attributes:
$names
[1] "coefficients" "residuals" "fitted.values" "effects"
[5] "R" "rank" "qr" "family"
[9] "linear.predictors" "deviance" "aic" "null.deviance"
[13] "iter" "weights" "prior.weights" "df.residual"
[17] "df.null" "y" "converged" "boundary"
[21] "model" "call" "formula" "terms"
[25] "data" "offset" "control" "method"
[29] "contrasts" "xlevels"
$class
[1] "glm" "lm"
and in case it's helpful, here's a little visualization of the simulated data and fit results:
My primary question: among the many attributes()
and names()
comprising the fit object, how do I access the dispersion parameter which is being counted as a third parameter by the logLik()
function?
Furthermore as an additional side issue, the official R documentation for glm states that "The default method "glm.fit" uses iteratively reweighted least squares (IWLS)." The wikipedia entry for IWLS states that it is a robust regression technique designed to minimize the influence of outliers, and it does this by recalculating the posterior weights associated with the observations. However, as you can see in the console output, all posterior weights are being set equal to 1; i.e., the IWLS functionality inside of glm()
does not seem to be getting triggered, in spite of simulated input data containing obvious outliers.
Extra bonus question: what's going on here, and why doesn't glm()
seem to behave as the documentation would suggest?
Edit: my purpose in asking this question was to seek further clarification on how the R stats module deals with counting the number of parameters in a model, and in particular, explore how it approaches estimating the dispersion parameter of a fitted model.
I ended up having an interesting follow-up exchange in the comment section with a contributor who answered my question. At his suggestion, I'm copy-pasting that brief thread up here, as I agree the additional information that he linked significantly adds to my understanding of the broader context in which the stats package developers have approached this issue.
Q: Your expression sqrt(deviance(obj)/(nobs(obj)-length(coef(obj))))
is not an independently adjustable parameter; i.e., not a "knob" you can turn to further increase argmax(LL). I assert: 1.) attr(logLik(fitobj), "df")
is mixing concepts when it says there are 3 parameters, as 2 are independently adjustable, while the 3rd (dispersion) is not; 2.) if a model selection package uses the "df" attribute to calculate AIC/BIC, as in the original question linked from 5 years ago, then it's wrong, because AIC/BIC should count only independently adjustable parameters. Would you agree or disagree?
A: I disagree. https://stackoverflow.com/a/67134012/3083138 (for further discussion you should probably ask a question on CrossValidated (I couldn't find one there already).
sigma(fitobj)
is the generic way to retrieve a dispersion parameter from a fitted model in R (works forlm
andglm
objects, as well as others likelme4::lmer()
etc.). Here "dispersion parameter" is defined as the residual standard deviation, sosigma(fitobj)^2
matches the value printed in the summary (and also matches @Kieran's answer that computes the mean residual sum of squares directly).stats:::sigma.default()
uses (approximately)For linear models, the deviance is the sum of squared residuals. (Dispersion parameters can also be estimated using the sum of squared Pearson residuals divided by
(n-p)
, and various other ways ... all of these estimates converge asymptotically to the same answer but differ in their property for finite sample sizes.)The last clause ("as a way of mitigating the influence of outliers ...") applies only to the second clause, about robust regression — it's not what IWLS does when it's applied in the GLM context. In GLMs, IWLS reweights the influence of observations based on their expected variance under an assumed model of the variance-mean relationship (for example, when fitting a Poisson model we would assume the variance was equal to the predicted mean value); outliers aren't adaptively handled in this case. The "weights" that are computed at the final step are available via
weights(fitobj, "working")
(as opposed to prior weights, which are related e.g. to the number of trials for a binomial variable). In this case, however, they are all exactly 1 because you specified (by default)family="gaussian"
, in which case all observations are assumed to have the same variance independent of their means → all weights are the same.This shows that, as you expected, there are a few points with very low weight (because you set them up as outliers), and many with relatively high weights.