titanic = pd.read_csv("train.csv")
titanic_test = pd.read_csv("test.csv")
titanic_train_labels = titanic['Survived'].copy()
titanic = titanic.drop(columns = 'Survived')

# Pipeline

titanic_num = ['Age', 'Fare']
titanic_cat = ['Sex', 'Embarked']

num_pipeline = Pipeline([
        ("imputer", SimpleImputer(strategy='median')),
        ("std_scaler", StandardScaler()),
    ])

cat_pipeline = Pipeline([
        ("enc", OneHotEncoder(drop='if_binary'))
    ])

def full_pipeline(num_attribs, cat_attribs):
    return ColumnTransformer([
        ("num", num_pipeline, num_attribs),
        ("cat", cat_pipeline, cat_attribs)
    ])

titanic_prepared = full_pipeline(titanic_num, titanic_cat)
titanic_clean = titanic_prepared.fit_transform(titanic)

# Here, I'm preparing the test data via the same pipeline

titanic_test_num = titanic_num
titanic_test_cat = titanic_cat
titanic_test_prepared = full_pipeline(titanic_test_num, titanic_test_cat)
titanic_test_clean = titanic_test_prepared.fit_transform(titanic_test)
final_model.fit(titanic_clean, titanic_train_labels)

The code giving the error on the title:

final_model.predict(titanic_test_clean)

Printing useful info that may give hints about the problem:

titanic_clean[0] -> array([-0.56573646, -0.50244517,  1.        ,  0.        ,  0.        ,
        1.        ,  0.        ]) # 7 items
titanic_test_clean[0] -> array([ 0.38623105, -0.49741333,  1.        ,  0.        ,  1.        ,
        0.        ]) # 6 items

From the info above, the problem I assume is with the non matching number of onecodeencoder. What I suspected was that the number of categorical values were not the same for the train and test set. But they actually are.

the link to the dataset -> https://github.com/minsuk-heo/kaggle-titanic/blob/master/input/test.csv

1

There are 1 best solutions below

0
DataJanitor On BEST ANSWER

The error you're seeing is indeed caused by OneHotEncoder.

However, I want to point out a more crucial point: It is not a good practice to put your pipeline into a function. Usually we assign the pipeline to a variable and then call fit and fit_transform on it:

# Define the pipelines for numerical and categorical attributes
num_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy='median')),
    ("std_scaler", StandardScaler()),
])

cat_pipeline = Pipeline([
    ("enc", OneHotEncoder(drop='if_binary'))
])

# Combine pipelines in a ColumnTransformer
full_pipeline = ColumnTransformer([
    ("num", num_pipeline, titanic_num),
    ("cat", cat_pipeline, titanic_cat)
])

# Fit and transform the training data
titanic_clean = full_pipeline.fit_transform(titanic)

# Transform the test data using the same transformations
titanic_test_clean = full_pipeline.transform(titanic_test)

# Model fitting and prediction
final_model.fit(titanic_clean, titanic_train_labels)
predictions = final_model.predict(titanic_test_clean)

This approach ensures that the same transformations are applied to both datasets, thereby maintaining a consistent feature set. The OneHotEncoder inside the ColumnTransformer will learn the categories from the training data and apply the same encoding to the test data, resolving the feature mismatch issue.