Truncated Poisson vs Hurdle model: why different values?

230 Views Asked by At

I want to do a regression with count data model, where doctor visits is the dependent variable. I did a two-part model with first a probit model for no doctor visit at all or one or more and then a Poisson model for at least one doctor visit. After that, I did a hurdle model as a robustness check, because as far as I know, I should get very similar values for both approaches. I do get nearly the same values for the probit part. I get, however, very different values for the Poisson part. Does anyone have any idea why? Here are the commands I used:

probit_doc <- glm(docbin ~ phi + gender +  age + health + educ + smoke + logthinc + 
                    wave + AUS + GER + SWE + NED + ESP + ITA + FRA + DEN + GRE + 
                    SWI + BEL + ISR + CZE + POL + LUX + HUN + POR + SVN + EST + 
                    CRO + LIT + BUL + CYP + FIN + LVA + MAL + ROM, 
                  data=allwaves, family=binomial(link="probit"))
  
poisson_doc <- glm(I(doc > 0) ~ phi + gender + age + health + educ + smoke + 
                     logthinc + wave + AUS + GER + SWE + NED + ESP + ITA + FRA +
                     DEN + GRE + SWI + BEL + ISR + CZE + POL + LUX + HUN + POR +
                     SVN + EST + CRO + LIT + BUL + CYP + FIN + LVA + MAL + ROM,
                   data=allwaves, family="poisson")
  
hd_doc <- hurdle(doc ~ phi + gender + age + educ + smoke + logthinc + wave + 
                   AUS + GER + SWE + NED + ESP + ITA + FRA + DEN + GRE + SWI +
                   BEL + ISR + CZE + POL + LUX + HUN + POR + SVN + EST + CRO +
                   LIT + BUL + CYP + FIN + LVA + MAL + ROM | phi + gender +  
                   age + health + educ + smoke + logthinc + wave + AUS + GER + 
                   SWE + NED + ESP + ITA + FRA + DEN + GRE + SWI + BEL + ISR + 
                   CZE + POL + LUX + HUN + POR + SVN + EST + CRO + LIT + BUL + 
                   CYP + FIN + LVA + MAL + ROM,
                 dist="poisson", data=allwaves, zero.dist="binomial", link="probit")
1

There are 1 best solutions below

0
On

In a hurdle model we usually calculate a zero part using a binomial model for the "hurdle", and a count part for what happens after the "hurdle" was passed. Accordingly, in the zero part we binarize the outcome on if it is unequal to zero, as you did correctly.

In the count part however, only the observations that have passed the hurdle are taken into account. So rather than manipulating the outcome variable as we did in the zero model, we want to subset the data to observations that have values unequal to zero.

Consider this example from the pscl package which you are obviously using.

library(pscl)
hurdle(art ~ ., dist="poisson", zero.dist="binomial", link="probit", data=bioChemists)$coe
# $count
# (Intercept)    femWomen  marMarried        kid5         phd        ment 
#  0.67113931 -0.22858266  0.09648499 -0.14218756 -0.01272637  0.01874548 
# 
# $zero
# (Intercept)    femWomen  marMarried        kid5         phd        ment 
#  0.15420081 -0.14616546  0.19834717 -0.17380191  0.01864404  0.04433777 

We may replicate the zero part by binarizing the outcome variable I(art > 0).

## zero model
glm(I(art > 0) ~ ., family=binomial(link='probit'), bioChemists)$coe
# (Intercept)    femWomen  marMarried        kid5         phd        ment 
#  0.15420081 -0.14616546  0.19834717 -0.17380191  0.01864405  0.04433781

For the count part, if we binarize the outcome variable I(art > 0) we get wrong results.

## count model misspecified
glm(I(art > 0) ~ ., family=poisson(), bioChemists)$coe  
# (Intercept)    femWomen  marMarried        kid5         phd        ment 
# -0.52364397 -0.07020092  0.08944930 -0.08814266  0.01987790  0.01258345

Instead, we want to subset the data on observations where the outcome is non-zero, and the values already better resemble those of the hurdle.

## count model correctly specified
glm(art ~ ., family=poisson(), subset(bioChemists, art > 0))$coe
# (Intercept)    femWomen  marMarried        kid5         phd        ment 
#  0.82722135 -0.16220935  0.06824070 -0.09902318 -0.01311912  0.01504273 

However, as you actually stated in your title, hurdle is calculating a truncated poisson model, which is also known as positive poisson distribution whereas you used a conventional poisson in the glm.

The VGAM package provides a pospoisson family function which gives exactly what we want.

VGAM::vglm(art ~ ., family=VGAM::pospoisson(), subset(bioChemists, art > 0)) |> 
  coefficients()
# (Intercept)    femWomen  marMarried        kid5         phd        ment 
#  0.67113934 -0.22858262  0.09648498 -0.14218724 -0.01272657  0.01874550 

The visual comparison of the two distributions shows that positive poisson is indeed better suited to fit the count part of a hurdle model.

enter image description here