I want to do a regression with count data model, where doctor visits is the dependent variable. I did a two-part model with first a probit model for no doctor visit at all or one or more and then a Poisson model for at least one doctor visit. After that, I did a hurdle model as a robustness check, because as far as I know, I should get very similar values for both approaches. I do get nearly the same values for the probit part. I get, however, very different values for the Poisson part. Does anyone have any idea why? Here are the commands I used:
probit_doc <- glm(docbin ~ phi + gender + age + health + educ + smoke + logthinc +
wave + AUS + GER + SWE + NED + ESP + ITA + FRA + DEN + GRE +
SWI + BEL + ISR + CZE + POL + LUX + HUN + POR + SVN + EST +
CRO + LIT + BUL + CYP + FIN + LVA + MAL + ROM,
data=allwaves, family=binomial(link="probit"))
poisson_doc <- glm(I(doc > 0) ~ phi + gender + age + health + educ + smoke +
logthinc + wave + AUS + GER + SWE + NED + ESP + ITA + FRA +
DEN + GRE + SWI + BEL + ISR + CZE + POL + LUX + HUN + POR +
SVN + EST + CRO + LIT + BUL + CYP + FIN + LVA + MAL + ROM,
data=allwaves, family="poisson")
hd_doc <- hurdle(doc ~ phi + gender + age + educ + smoke + logthinc + wave +
AUS + GER + SWE + NED + ESP + ITA + FRA + DEN + GRE + SWI +
BEL + ISR + CZE + POL + LUX + HUN + POR + SVN + EST + CRO +
LIT + BUL + CYP + FIN + LVA + MAL + ROM | phi + gender +
age + health + educ + smoke + logthinc + wave + AUS + GER +
SWE + NED + ESP + ITA + FRA + DEN + GRE + SWI + BEL + ISR +
CZE + POL + LUX + HUN + POR + SVN + EST + CRO + LIT + BUL +
CYP + FIN + LVA + MAL + ROM,
dist="poisson", data=allwaves, zero.dist="binomial", link="probit")
In a hurdle model we usually calculate a zero part using a
binomial
model for the "hurdle", and a count part for what happens after the "hurdle" was passed. Accordingly, in the zero part we binarize the outcome on if it is unequal to zero, as you did correctly.In the count part however, only the observations that have passed the hurdle are taken into account. So rather than manipulating the outcome variable as we did in the zero model, we want to
subset
the data to observations that have values unequal to zero.Consider this example from the
pscl
package which you are obviously using.We may replicate the zero part by binarizing the outcome variable
I(art > 0)
.For the count part, if we binarize the outcome variable
I(art > 0)
we get wrong results.Instead, we want to
subset
the data on observations where the outcome is non-zero, and the values already better resemble those of thehurdle
.However, as you actually stated in your title,
hurdle
is calculating a truncated poisson model, which is also known as positive poisson distribution whereas you used a conventional poisson in theglm
.The
VGAM
package provides apospoisson
family function which gives exactly what we want.The visual comparison of the two distributions shows that positive poisson is indeed better suited to fit the count part of a hurdle model.