glm model dataset summarisation

436 Views Asked by At

first post, so go easy.

In the insurance world of GLMing, the classic approach is to model claims frequency and average severity. With that in mind, I built a couple of models to experiment for myself and now have a question.

Could somebody please explain how GLM handles varying levels of summarisation of a dataset, particularly with regard to error estimates?

Consider the example below. The data exhibits strong severity trends for both variables: - A has more expensive claims than B - Ford > Kia > Vaux > Jag

I fitted a model to unsummarised and a summarised version of the dataset, and accordingly GLM fitted the same parameters in both cases

However, GLM indicates a well fitted model to the unsummarised data. But when I summarise and use a weighted mean, ie average severity, the model fits poorly. Maybe this is as you would expect, after all the unsummarised data has more points to model with. Also, it appears the weighted mean is used to indicate RELATIVE strength, so here, specifiying the weighted mean is pointless, since they are all the same weights.

But more fundementally, can I not model average severity with GLM? I mean, I know the result of fitting a GLM to an unsummarised dataset will be a average severity, but I was hoping to fit a model to already summarised data. It appears that modelling on aggregated datasets will not give a true indication of the model fit.

Apologies if this a stupid question, I'm not a statistician, so don't fully understand the Hessian Matrix.

Please see code below:

library(boot) 
library(reshape) 

dataset <- data.frame(
  Person = rep(c("A", "B"), each=200), 
  Car = rep(c("Ford", "Kia", "Vaux", "Jag"), 2, each=50),
  Amount = c(rgamma(50, 200), rgamma(50, 180), rgamma(50, 160), rgamma(50, 140), 
         rgamma(50, 100), rgamma(50, 80), rgamma(50, 60), rgamma(50, 40))
)

Agg1 <- ddply(dataset, .(Person, Car), summarise, mean=mean(Amount), length=length(Amount))

m1 <- glm(Amount ~ Person + Car, data = dataset, family = Gamma(link="log")) 
m2 <- glm(mean ~ Person + Car, data = Agg1, family = Gamma(link="log"), weights=length) 

summary(m1)
summary(m2)

Thanks,

Nick

1

There are 1 best solutions below

0
On

Bottom line is that both models are identical - the reason the aggregated model "fits poorly" is entirely due to the reduction in degrees of freedom due to aggregation.

Before getting into why the models are identical, I should point out that this does not necessarily mean that either model is a good fit. You should run diagnostics on both, especially using:

par(mfrow=c(2,2))
plot(m1)

When you do this. you'll see that the residuals are normally distributed (which is essential), but that they follow a pattern (-, +, -), which is disturbing. I would want to understand that before declaring that this is a good model. [Admittedly, this is made up data, but the principles apply nevertheless.]

Comparing the aggregated to base models, look at the values of the coefficients.

coef.m1 <- summary(m1)$coefficients
coef.m2 <- summary(m2)$coefficients
cbind(coef.m1[,1],coef.m2[,1])
#                   [,1]       [,2]
# (Intercept)  5.4096980  5.4096976
# PersonB     -0.9249371 -0.9249366
# CarJag      -0.6144606 -0.6144602
# CarKia      -0.1786556 -0.1786555
# CarVaux     -0.3597925 -0.3597923

The reason you think the aggregated model is "worse" is because of the p-values, but these depend on t = coeff/se . The ratio of se in m1 vs. m2 is the same for all coefficients:

coef.m2[,2]/coef.m1[,2]
# (Intercept)     PersonB      CarJag      CarKia     CarVaux 
#    7.836171    7.836171    7.836171    7.836171    7.836171 

Since

se ~ sd / √ df

the ratio of se for the two models should be approx

sem1/sem2 = √( (nm1-1) / (nm2-1) )

sqrt((nrow(dataset)-1)/(nrow(Agg1)-1))
# [1] 7.549834

Frankly I'm puzzled why the ratio is not exactly equal to 7.55.

Put another way, glm(...) has no way of knowing that you aggregated your data. It thinks you are trying to fit a model with 4 parameters and an intercept to 8 data points.