How to remove features from regression results using bonferroni correction results?

151 Views Asked by At

I implemented a regression model using

formula= "cost ~ C(state) + group_size + C(homeowner) + car_age + C(car_value) + 
risk_factor + age_oldest + age_youngest + C(married_couple) + c_previous + 
duration_previous + C(a) + C(b) + C(c) + C(d) + C(e) + C(f) + C(g)"

model_a = smf.ols(formula = formula, data = train).fit()
model_a.summary()

After fitting a regression model, I ran a bonferroni correction using

smt.multipletests(model_a.pvalues, alpha=0.05, method='bonferroni', is_sorted=False, 
returnsorted=False)

And I get the following result:

(array([ True, False,  True,  True,  True,  True,  True, False,  True,
     True,  True, False,  True,  True,  True,  True, False, False,
    False, False,  True, False,  True,  True,  True,  True,  True,
     True,  True, False,  True,  True, False,  True,  True, False,
     True,  True,  True,  True,  True,  True,  True,  True, False,
     True,  True,  True, False, False, False, False, False, False,
     True,  True,  True,  True,  True, False,  True, False,  True,
    False,  True,  True,  True,  True]),
 array([0.00000000e+00, 1.00000000e+00, 1.45352365e-03, 2.14422252e-21,
    5.68726115e-13, 4.81466313e-12, 1.22517937e-05, 3.36565323e-01,
    4.81396354e-45, 1.51138583e-05, 4.27572151e-04, 1.00000000e+00,
    5.91690245e-10, 2.62041907e-16, 3.12129589e-18, 9.88879325e-13,
    1.00000000e+00, 1.00000000e+00, 1.00000000e+00, 6.85853188e-01,
    8.94886169e-07, 1.00000000e+00, 3.55801455e-12, 5.35987286e-54,
    7.77655333e-03, 5.45090922e-04, 5.15690091e-03, 7.40791788e-04,
    1.24797586e-07, 1.00000000e+00, 2.91991310e-04, 1.75502703e-07,
    1.00000000e+00, 2.57023089e-26, 2.34824045e-10, 1.00000000e+00,
    2.79360586e-87, 5.26115182e-09, 4.94812967e-08, 3.36073545e-07,
    5.06333547e-07, 4.44900552e-07, 1.06078148e-05, 1.42866234e-03,
    1.00000000e+00, 3.72074539e-10, 1.38294896e-74, 1.39540646e-69,
    1.00000000e+00, 1.00000000e+00, 1.00000000e+00, 1.00000000e+00,
    1.00000000e+00, 1.00000000e+00, 2.78538149e-18, 3.74576314e-22,
    1.12111501e-19, 1.14698339e-04, 9.34411232e-18, 1.00000000e+00,
    4.10430857e-02, 1.00000000e+00, 5.35030644e-23, 1.00000000e+00,
    7.61651080e-20, 9.49735915e-56, 7.90523832e-66, 8.15390766e-94]),
 0.0007540287301109894,
 0.0007352941176470588)

I want to use these arrays to remove the features in model_a that are False and create a new model 'train_simplified'.

I'm using the following manual approach, but I want to know if there´s a more efficient way to do it.

train_simplified = train.drop(train.columns[[0, 1, 2, 4, 10, 16, 25, 27, 28, 30, 36, 38, 
41, 44, 47, 55, 61, 62, 63, 64, 65, 66, 67, 68, 69, 75, 78]], axis=1)
1

There are 1 best solutions below

0
On BEST ANSWER

You could use Pandas loc to select only the features in model_a that are True.

.loc[] is primarily label based, but may also be used with a boolean array.

train = pd.DataFrame(np.random.rand(5,68))
          0         1         2         3  ...        63        64        65        66        67
0  0.637557  0.887213  0.472215  0.119594  ...  0.908266  0.239562  0.144895  0.489453  0.985650
1  0.242055  0.672136  0.761620  0.237638  ...  0.649633  0.849223  0.657613  0.568309  0.093675
2  0.367716  0.265202  0.243990  0.973011  ...  0.465598  0.542645  0.286541  0.590833  0.030500
3  0.037348  0.822601  0.360191  0.127061  ...  0.070569  0.642419  0.026511  0.585776  0.940230
4  0.575474  0.388170  0.643288  0.458253  ...  0.091206  0.494420  0.057559  0.549529  0.441531

[5 rows x 68 columns]
keep_columns = np.array([ # array from smt.multipletests
    True, False,  True,  True,  True,  True,  True, False,  True,
    True,  True, False,  True,  True,  True,  True, False, False,
    False, False,  True, False,  True,  True,  True,  True,  True,
    True,  True, False,  True,  True, False,  True,  True, False,
    True,  True,  True,  True,  True,  True,  True,  True, False,
    True,  True,  True, False, False, False, False, False, False,
    True,  True,  True,  True,  True, False,  True, False,  True,
    False,  True,  True,  True,  True])
np.sum(keep_columns) # 47 (keep 47 columns)

train_simplified = train.loc[:,keep_columns]

Output from train_simplified

          0         2         3         4  ...        62        64        65        66        67
0  0.637557  0.472215  0.119594  0.713245  ...  0.278646  0.239562  0.144895  0.489453  0.985650
1  0.242055  0.761620  0.237638  0.728216  ...  0.746491  0.849223  0.657613  0.568309  0.093675
2  0.367716  0.243990  0.973011  0.393098  ...  0.035942  0.542645  0.286541  0.590833  0.030500
3  0.037348  0.360191  0.127061  0.522243  ...  0.162934  0.642419  0.026511  0.585776  0.940230
4  0.575474  0.643288  0.458253  0.545617  ...  0.789618  0.494420  0.057559  0.549529  0.441531

[5 rows x 47 columns]