How does fixest handle negative values of the demeaned dependent variable in poisson estimations?

729 Views Asked by At

I need to perform glm (poisson) estimations with fixed-effects (say merely unit FE) and several regressors (RHS variables). I have an unbalanced panel dataset where most (~90%) observations have missing values (NA) for some but not all regressors.

fixest::feglm() can handle this and returns my fitted model. However, to do so, it (and fixest::demean too) removes observations that have at least one regressor missing, before constructing the fixed-effect means.

In my case, I am afraid this implies not using a significant share of available information in the data. Therefore, I would like to demean my variables by hand, to be able to include as much information as possible in each fixed-effect dimension's mean, and then run feglm on the demeaned data. However, this implies getting negative dependent variable values, which is not compatible with Poisson. If I run feglm with "poisson" family and my manually demeaned data, I (coherently) get: "Negative values of the dependent variable are not allowed for the "poisson" family.". The same error is returned with data demeaned with the fixest::demean function.

Question:

How does feglm handle negative values of the demeaned dependent variable? Is there a way (like some data transformation) to reproduce fepois on a fixed-effect in the formula with fepois on demeaned data and a no fixed-effect formula?

To use the example from fixest::demean documentation (with two-way fixed-effects):

data(trade)

base = trade
base$ln_dist = log(base$dist_km)
base$ln_euros = log(base$Euros)

# We center the two variables ln_dist and ln_euros
#  on the factors Origin and Destination
X_demean = demean(X = base[, c("ln_dist", "ln_euros")],
                  fe = base[, c("Origin", "Destination")])
base[, c("ln_dist_dm", "ln_euros_dm")] = X_demean

and I would like to reproduce

est_fe = fepois(ln_euros ~ ln_dist | Origin + Destination, base)

with

est = fepois(ln_euros_dm ~ ln_dist_dm, base)
1

There are 1 best solutions below

2
On

I think there are two main problems.

Modelling strategy

In general, it is important to be able to formally describe the estimated model. In this case it wouldn't be possible to write down the model with a single equation, where the fixed-effects are estimated using all the data and other variables only on the non-missing observations. And if the model is not clear, then... maybe it's not a good model.

On the other hand, if your model is well defined, then removing random observations should not change the expectation of the coefficients, only their variance. So again, if your model is well specified, you shouldn't worry too much.

By suggesting that observations with missing values are relevant to estimate the fixed-effects coefficients (or stated differently, that they are used to demean some variables) you are implying that these observations are not randomly distributed. And now you should worry.

Just using these observations to demean the variables wouldn't remove the bias on the estimated coefficients due to the selection to non-missingness. That's a deeper problem that cannot be removed by technical tricks but rather by a profound understanding of the data.

GLM

There is a misunderstanding with GLM. GLM is a super smart trick to estimate maximum likelihood models with OLS (there's a nice description here). It was developed and used at a time when regular optimization techniques were very expensive in terms of computational time, and it was a way to instead employ well developed and fast OLS techniques to perform equivalent estimations.

GLM is an iterative process where typical OLS estimations are performed at each step, the only changes at each iteration concern the weights associated to each observation. Therefore, since it's a regular OLS process, techniques to perform fast OLS estimations with multiple fixed-effects can be leveraged (as is in the fixest package).

So actually, you could do what you want... but only within the OLS step of the GLM algorithm. By no means you should demean the data before running GLM because, well, it makes no sense (the FWL theorem has absolutely no hold here).