titanic = pd.read_csv("train.csv")
titanic_test = pd.read_csv("test.csv")
titanic_train_labels = titanic['Survived'].copy()
titanic = titanic.drop(columns = 'Survived')
# Pipeline
titanic_num = ['Age', 'Fare']
titanic_cat = ['Sex', 'Embarked']
num_pipeline = Pipeline([
("imputer", SimpleImputer(strategy='median')),
("std_scaler", StandardScaler()),
])
cat_pipeline = Pipeline([
("enc", OneHotEncoder(drop='if_binary'))
])
def full_pipeline(num_attribs, cat_attribs):
return ColumnTransformer([
("num", num_pipeline, num_attribs),
("cat", cat_pipeline, cat_attribs)
])
titanic_prepared = full_pipeline(titanic_num, titanic_cat)
titanic_clean = titanic_prepared.fit_transform(titanic)
# Here, I'm preparing the test data via the same pipeline
titanic_test_num = titanic_num
titanic_test_cat = titanic_cat
titanic_test_prepared = full_pipeline(titanic_test_num, titanic_test_cat)
titanic_test_clean = titanic_test_prepared.fit_transform(titanic_test)
final_model.fit(titanic_clean, titanic_train_labels)
The code giving the error on the title:
final_model.predict(titanic_test_clean)
Printing useful info that may give hints about the problem:
titanic_clean[0] -> array([-0.56573646, -0.50244517, 1. , 0. , 0. ,
1. , 0. ]) # 7 items
titanic_test_clean[0] -> array([ 0.38623105, -0.49741333, 1. , 0. , 1. ,
0. ]) # 6 items
From the info above, the problem I assume is with the non matching number of onecodeencoder. What I suspected was that the number of categorical values were not the same for the train and test set. But they actually are.
the link to the dataset -> https://github.com/minsuk-heo/kaggle-titanic/blob/master/input/test.csv
The error you're seeing is indeed caused by
OneHotEncoder.However, I want to point out a more crucial point: It is not a good practice to put your pipeline into a function. Usually we assign the pipeline to a variable and then call
fitandfit_transformon it:This approach ensures that the same transformations are applied to both datasets, thereby maintaining a consistent feature set. The
OneHotEncoderinside theColumnTransformerwill learn the categories from the training data and apply the same encoding to the test data, resolving the feature mismatch issue.