I'm new to machine learning and programming in general, so please take it easy on me.
I'm doing this Titanic thing on Kaggle and tried using make_column_transformer() and pipeline after using train_test_split(). My code is as below.
preprocess = make_column_transformer(
(StandardScaler(), ['Age','Fare']),
(OneHotEncoder(), ['Pclass', 'SibSp','Parch', 'Family_size','Sex', 'Embarked', 'Initial', 'Fare_cat']))
model = make_pipeline(preprocess, LogisticRegression())
model.fit(X_train, y_train)
predictions = model.predict(X_test)
And this works just fine. However, when I tried it on the test dataset for the submission
y_train_submit = train_data.Survived.values
X_train_submit = train_data[['Pclass','Sex','Age','SibSp','Parch','Fare','Embarked', 'Initial','Family_size', 'Fare_cat']]
X_test_submit = test_data[['Pclass','Sex','Age','SibSp','Parch','Fare','Embarked', 'Initial','Family_size', 'Fare_cat']]
model.fit(X_train_submit, y_train_submit)
predictions_submit = model.predict(X_test_submit)
it gives this error on the predictions_submit line.
ValueError: Found unknown categories [9] in column 2 during transform
After some experiments, I figured that the error comes from OneHotEncoder(). The columns and data types are all exactly the same, I did the exact same thing to both DataFrames, so why doesn't it work with the test dataset and how should I apply OneHotEncoder in this case?
you should add the function make_column_transformer here, people cannot know what it does if you do not share it.
My feeling is that you made two One Hot Encoder objects, one for the training and one for the test, which is wrong. You should use the one for the training set in order to transorm both training and test. You should add also handle_unknown aegument, in case that you have in the test set an element which does not appear in the training set.