I am trying to balance check on a Pandas DataFrame using an OLS with entity fixed effects. An example DataFrame is below:
county | year | treatment_vs_control | age | gender |
---|---|---|---|---|
Jefferson | 2022 | 1 | 24 | M |
Jackson | 2022 | 1 | 31 | M |
Jefferson | 2022 | 0 | 28 | F |
Jackson | 2022 | 1 | 24 | null |
Adams | 2022 | 0 | 72 | F |
First I try to run the model with the gender field as-is.
model_as_is = PanelOLS.from_formula(
formula="treatment_vs_control ~ age + gender + EntityEffects",
data=df
).fit()
model_as_is.summary
I get an F statistics of ~3.05 with a p value of 0.0001.
Then, I try to run the model with one-hot encoded dummy gender columns. The DataFrame looks like below:
county | year | treatment_vs_control | age | gender_m | gender_f |
---|---|---|---|---|---|
Jefferson | 2022 | 1 | 24 | 1 | 0 |
Jackson | 2022 | 1 | 31 | 1 | 0 |
Jefferson | 2022 | 0 | 28 | 0 | 1 |
Jackson | 2022 | 1 | 24 | 0 | 0 |
Adams | 2022 | 0 | 72 | 0 | 1 |
My model now looks like:
model_dummy = PanelOLS(
dependent = df["treatment_vs_control"],
exog = df[["age", "gender"]],
entity_effects=True,
time_effects=False,
).fit()
model_dummy.summary
My F statistic is now ~2.61 with a p value of 0.0002.
If I try to simply keep a single gender column but make it numeric instead of string-type, I get even a third statistical breakdown.
Why might this happen?