Getting "ValueError: X has 6 features, but LinearRegression is expecting 7 features as input." probably due to ColumnTransformation (Pipeline) step

Question

Getting "ValueError: X has 6 features, but LinearRegression is expecting 7 features as input." probably due to ColumnTransformation (Pipeline) step

72 Views Asked by runedus At 22 December 2023 at 09:05

titanic = pd.read_csv("train.csv")
titanic_test = pd.read_csv("test.csv")
titanic_train_labels = titanic['Survived'].copy()
titanic = titanic.drop(columns = 'Survived')

# Pipeline

titanic_num = ['Age', 'Fare']
titanic_cat = ['Sex', 'Embarked']

num_pipeline = Pipeline([
        ("imputer", SimpleImputer(strategy='median')),
        ("std_scaler", StandardScaler()),
    ])

cat_pipeline = Pipeline([
        ("enc", OneHotEncoder(drop='if_binary'))
    ])

def full_pipeline(num_attribs, cat_attribs):
    return ColumnTransformer([
        ("num", num_pipeline, num_attribs),
        ("cat", cat_pipeline, cat_attribs)
    ])

titanic_prepared = full_pipeline(titanic_num, titanic_cat)
titanic_clean = titanic_prepared.fit_transform(titanic)

# Here, I'm preparing the test data via the same pipeline

titanic_test_num = titanic_num
titanic_test_cat = titanic_cat
titanic_test_prepared = full_pipeline(titanic_test_num, titanic_test_cat)
titanic_test_clean = titanic_test_prepared.fit_transform(titanic_test)
final_model.fit(titanic_clean, titanic_train_labels)

The code giving the error on the title:

final_model.predict(titanic_test_clean)

Printing useful info that may give hints about the problem:

titanic_clean[0] -> array([-0.56573646, -0.50244517,  1.        ,  0.        ,  0.        ,
        1.        ,  0.        ]) # 7 items
titanic_test_clean[0] -> array([ 0.38623105, -0.49741333,  1.        ,  0.        ,  1.        ,
        0.        ]) # 6 items

From the info above, the problem I assume is with the non matching number of onecodeencoder. What I suspected was that the number of categorical values were not the same for the train and test set. But they actually are.

the link to the dataset -> https://github.com/minsuk-heo/kaggle-titanic/blob/master/input/test.csv

Original Q&A

There are 1 best solutions below

**DataJanitor** · Accepted Answer · 2023-12-22T14:33:23.300000

The error you're seeing is indeed caused by OneHotEncoder.

However, I want to point out a more crucial point: It is not a good practice to put your pipeline into a function. Usually we assign the pipeline to a variable and then call fit and fit_transform on it:

# Define the pipelines for numerical and categorical attributes
num_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy='median')),
    ("std_scaler", StandardScaler()),
])

cat_pipeline = Pipeline([
    ("enc", OneHotEncoder(drop='if_binary'))
])

# Combine pipelines in a ColumnTransformer
full_pipeline = ColumnTransformer([
    ("num", num_pipeline, titanic_num),
    ("cat", cat_pipeline, titanic_cat)
])

# Fit and transform the training data
titanic_clean = full_pipeline.fit_transform(titanic)

# Transform the test data using the same transformations
titanic_test_clean = full_pipeline.transform(titanic_test)

# Model fitting and prediction
final_model.fit(titanic_clean, titanic_train_labels)
predictions = final_model.predict(titanic_test_clean)

This approach ensures that the same transformations are applied to both datasets, thereby maintaining a consistent feature set. The OneHotEncoder inside the ColumnTransformer will learn the categories from the training data and apply the same encoding to the test data, resolving the feature mismatch issue.

Getting "ValueError: X has 6 features, but LinearRegression is expecting 7 features as input." probably due to ColumnTransformation (Pipeline) step

There are 1 best solutions below

Related Questions in MACHINE-LEARNING

Related Questions in SCIKIT-LEARN

Related Questions in PIPELINE

Related Questions in ENCODER

Trending Questions

Popular # Hahtags

Popular Questions