Python PanelOLS different statistics with single categorical and multiple dummy columns

241 Views Asked by At

I am trying to balance check on a Pandas DataFrame using an OLS with entity fixed effects. An example DataFrame is below:

county year treatment_vs_control age gender
Jefferson 2022 1 24 M
Jackson 2022 1 31 M
Jefferson 2022 0 28 F
Jackson 2022 1 24 null
Adams 2022 0 72 F

First I try to run the model with the gender field as-is.

model_as_is = PanelOLS.from_formula(
    formula="treatment_vs_control ~ age + gender + EntityEffects",
    data=df
).fit()

model_as_is.summary

I get an F statistics of ~3.05 with a p value of 0.0001.

Then, I try to run the model with one-hot encoded dummy gender columns. The DataFrame looks like below:

county year treatment_vs_control age gender_m gender_f
Jefferson 2022 1 24 1 0
Jackson 2022 1 31 1 0
Jefferson 2022 0 28 0 1
Jackson 2022 1 24 0 0
Adams 2022 0 72 0 1

My model now looks like:

model_dummy = PanelOLS(
    dependent = df["treatment_vs_control"], 
    exog = df[["age", "gender"]], 
    entity_effects=True, 
    time_effects=False,
).fit()

model_dummy.summary

My F statistic is now ~2.61 with a p value of 0.0002.

If I try to simply keep a single gender column but make it numeric instead of string-type, I get even a third statistical breakdown.

Why might this happen?

0

There are 0 best solutions below