I implemented a regression model using
formula= "cost ~ C(state) + group_size + C(homeowner) + car_age + C(car_value) +
risk_factor + age_oldest + age_youngest + C(married_couple) + c_previous +
duration_previous + C(a) + C(b) + C(c) + C(d) + C(e) + C(f) + C(g)"
model_a = smf.ols(formula = formula, data = train).fit()
model_a.summary()
After fitting a regression model, I ran a bonferroni correction using
smt.multipletests(model_a.pvalues, alpha=0.05, method='bonferroni', is_sorted=False,
returnsorted=False)
And I get the following result:
(array([ True, False, True, True, True, True, True, False, True,
True, True, False, True, True, True, True, False, False,
False, False, True, False, True, True, True, True, True,
True, True, False, True, True, False, True, True, False,
True, True, True, True, True, True, True, True, False,
True, True, True, False, False, False, False, False, False,
True, True, True, True, True, False, True, False, True,
False, True, True, True, True]),
array([0.00000000e+00, 1.00000000e+00, 1.45352365e-03, 2.14422252e-21,
5.68726115e-13, 4.81466313e-12, 1.22517937e-05, 3.36565323e-01,
4.81396354e-45, 1.51138583e-05, 4.27572151e-04, 1.00000000e+00,
5.91690245e-10, 2.62041907e-16, 3.12129589e-18, 9.88879325e-13,
1.00000000e+00, 1.00000000e+00, 1.00000000e+00, 6.85853188e-01,
8.94886169e-07, 1.00000000e+00, 3.55801455e-12, 5.35987286e-54,
7.77655333e-03, 5.45090922e-04, 5.15690091e-03, 7.40791788e-04,
1.24797586e-07, 1.00000000e+00, 2.91991310e-04, 1.75502703e-07,
1.00000000e+00, 2.57023089e-26, 2.34824045e-10, 1.00000000e+00,
2.79360586e-87, 5.26115182e-09, 4.94812967e-08, 3.36073545e-07,
5.06333547e-07, 4.44900552e-07, 1.06078148e-05, 1.42866234e-03,
1.00000000e+00, 3.72074539e-10, 1.38294896e-74, 1.39540646e-69,
1.00000000e+00, 1.00000000e+00, 1.00000000e+00, 1.00000000e+00,
1.00000000e+00, 1.00000000e+00, 2.78538149e-18, 3.74576314e-22,
1.12111501e-19, 1.14698339e-04, 9.34411232e-18, 1.00000000e+00,
4.10430857e-02, 1.00000000e+00, 5.35030644e-23, 1.00000000e+00,
7.61651080e-20, 9.49735915e-56, 7.90523832e-66, 8.15390766e-94]),
0.0007540287301109894,
0.0007352941176470588)
I want to use these arrays to remove the features in model_a that are False and create a new model 'train_simplified'.
I'm using the following manual approach, but I want to know if there´s a more efficient way to do it.
train_simplified = train.drop(train.columns[[0, 1, 2, 4, 10, 16, 25, 27, 28, 30, 36, 38,
41, 44, 47, 55, 61, 62, 63, 64, 65, 66, 67, 68, 69, 75, 78]], axis=1)
You could use Pandas
loc
to select only the features inmodel_a
that areTrue
.Output from train_simplified