How to correctly use model explainer with unseen data?

174 Views Asked by At

I trained my classifier using a pipeline:

param_tuning = {

        'classifier__learning_rate': [0.01, 0.1],
        'classifier__max_depth': [3, 5, 7, 10],
        'classifier__min_child_weight': [1, 3, 5],
        'classifier__subsample': [0.5, 0.7],
        'classifier__n_estimators' : [100, 200, 500],
    }

cat_pipe = Pipeline(
    [
        ('selector', ColumnSelector(categorical_features)),
        ('encoder', ce.one_hot.OneHotEncoder())
    ]
)

num_pipe = Pipeline(
    [
        ('selector', ColumnSelector(numeric_features)),
        ('scaler', StandardScaler())
    ]
)

preprocessor = FeatureUnion(
    transformer_list=[

        ('cat', cat_pipe),
        ('num', num_pipe)
    ]
)

xgb_pipe = Pipeline(
    steps=[
        ('preprocessor', preprocessor),
        ('classifier', xgb.XGBClassifier())
    ]
)

grid = GridSearchCV(xgb_pipe, param_tuning, cv=5, n_jobs=-1, scoring='accuracy')

xgb_model = grid.fit(X_train, y_train)

The training data have categorical data, so the transformed data shape is (x , 100 ). After that, i try to explain model prediction on unseen data. Since i pass single unseen example directly to model, it preprocessed it in shape (x, 15) (because single observation does not have all examples all categorical data).

eli5.show_prediction(xgb['classifier'], xgb['preprocessor'].fit_transform(df), columns = xgb['classifier'].get_booster().feature_names))

And i got

ValueError: Shape of passed values is (1, 15), indices imply (1, 100).

This occurs because model was trained on whole preprocessed dataset with shape (x, 100), but i pass to explainer single observation with shape (1,15). How do i correctly pass unseen single observation to explainer?

1

There are 1 best solutions below

0
On BEST ANSWER

We never use .fit_transform() on unseen data; the correct way is to use the .transform() method of the pre-processor already fitted with your training data (here xgb['preprocessor']). That way, we ensure that the (transformed) unseen data have the same features with our (transformed) training ones, and so they are compatible with the model built with the latter.

So, you should replace .fit_transform(df) here:

eli5.show_prediction(xgb['classifier'], xgb['preprocessor'].fit_transform(df), columns = xgb['classifier'].get_booster().feature_names))

with .transform(df):

eli5.show_prediction(xgb['classifier'], xgb['preprocessor'].transform(df), columns = xgb['classifier'].get_booster().feature_names))