I'm new to machine learning and programming in general, so please take it easy on me.

I'm doing this Titanic thing on Kaggle and tried using make_column_transformer() and pipeline after using train_test_split(). My code is as below.

    preprocess = make_column_transformer(
        (StandardScaler(), ['Age','Fare']),
        (OneHotEncoder(), ['Pclass', 'SibSp','Parch', 'Family_size','Sex', 'Embarked', 'Initial', 'Fare_cat']))


    model = make_pipeline(preprocess, LogisticRegression())
    
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)

And this works just fine. However, when I tried it on the test dataset for the submission

y_train_submit = train_data.Survived.values
X_train_submit = train_data[['Pclass','Sex','Age','SibSp','Parch','Fare','Embarked', 'Initial','Family_size', 'Fare_cat']]
X_test_submit = test_data[['Pclass','Sex','Age','SibSp','Parch','Fare','Embarked', 'Initial','Family_size', 'Fare_cat']]

model.fit(X_train_submit, y_train_submit)

predictions_submit = model.predict(X_test_submit)

it gives this error on the predictions_submit line.

ValueError: Found unknown categories [9] in column 2 during transform

After some experiments, I figured that the error comes from OneHotEncoder(). The columns and data types are all exactly the same, I did the exact same thing to both DataFrames, so why doesn't it work with the test dataset and how should I apply OneHotEncoder in this case?

2

There are 2 best solutions below

0
On

you should add the function make_column_transformer here, people cannot know what it does if you do not share it.

My feeling is that you made two One Hot Encoder objects, one for the training and one for the test, which is wrong. You should use the one for the training set in order to transorm both training and test. You should add also handle_unknown aegument, in case that you have in the test set an element which does not appear in the training set.

from sklearn.preprocessing import OneHotEncoder

enc = OneHotEncoder(handle_unknown='ignore')
enc.fit(X_train_submit)

X_train_submit = enc.transform(X_train_submit)
X_test_submit = enc.transform(X_test_submit)

0
On

The error is telling you that one of your categorical features has a new category in it. When you trained the OneHotEncoder, it saved all the unique values in those columns and makes a dummy column for each of those; but it didn't see 9 in one of those columns in the training (or testing) data, but in the submission data a 9 is present.

You can set handle_unknown='ignore' in the encoder to silently ignore unseen levels, encoding them as all-zeros.

handle_unknown : {‘error’, ‘ignore’}, default=’error’
Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to raise). When this parameter is set to ‘ignore’ and an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None.