Can I correct the coefficient standard errors after oversampling my data?

109 Views Asked by At

I am trying to fit a fixed effects linear regression to my data and interpret the coefficients. I have an imbalanced dataset (~97% negative cases), which was affecting my ability to fit the model and calculate coefficients for every independent variable, so I used SMOTE to oversample the positive cases and roughly double the size of my dataset. I care way more about the coefficient values and standard errors than the actual predictive accuracy of the model-- the question I am trying to answer is "what is the effect of x on y?" But because my SMOTE dataset is twice as large as my original dataset, my standard errors are artificially small/overconfident. Is there a way to correct for this and keep the SMOTE coefficient estimates while calculating standard errors based on the original data?

1

There are 1 best solutions below

1
Next Door Engineer On

You have to correct this by doing something like this - Recalibrate predicted probabilities.

Or you can do a weighted regression as well -

weights = np.where(original_data_flag, 1/np.mean(original_data_flag), 1/np.mean(~original_data_flag))

lm = LinearRegression()
lm.fit(x, y, sample_weight=weights)