I'm trying to use CountVectorizer()
with Pipeline
and ColumnTransformer
. Because CountVectorizer()
produces sparse matrix, I used FunctionTransformer
to ensure the ColumnTransformer
can hstack
correctly when putting together the resulting matrix.
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from typing import Callable
# Dataset
df = pd.DataFrame([['a', 'Hi Tom', 'It is hot', 1],
['b', 'How you been Tom', 'hot coffee', 2],
['c', 'Hi you', 'I want some coffee', 3]],
columns=['col_for_ohe', 'col_for_countvectorizer_1', 'col_for_countvectorizer_2', 'num_col'])
# Use FunctionTransformer to ensure dense matrix
def tf_text(X, vectorizer_tf: Callable):
X_vect_ = vectorizer_tf.fit_transform(X)
return X_vect_.toarray()
tf_transformer = FunctionTransformer(tf_text, kw_args={'vectorizer_tf': CountVectorizer()})
# Transformation Pipelines
tf_transformer_pipe = Pipeline(
steps = [('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('tf', tf_transformer)])
ohe_transformer_pipe = Pipeline(
steps = [('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('ohe', OneHotEncoder(handle_unknown='ignore', sparse=False))])
transformer = ColumnTransformer(transformers=[
('cat_ohe', ohe_transformer_pipe, ['col_for_ohe']),
('cat_tf', tf_transformer_pipe, ['col_for_countvectorizer_1', 'col_for_countvectorizer_2'])
], remainder='passthrough')
transformed_df = transformer.fit_transform(df)
I get AttributeError: 'numpy.ndarray' object has no attribute 'lower.' I've seen this question and suspect CountVectorizer()
is the culprit but not sure how to solve it (previous question doesn't use ColumnTransformer
). I stumbled upon a DenseTransformer
that I wish I could use instead of FunctionTransformer
but unfortunately it is not supported in my company.
Imo, the first consideration to be done is that
CountVectorizer()
requires 1D input; your example is not working because the imputation is returning a 2D numpy array which means that you'll need to add a customized treatment to make it work.Then you should also consider that when using a
CountVectorizer()
instance (which - again - requires 1D input) as transformer in aColumnTransformer()
that's how you should pass transformers' columns:This would be useful in interpreting the snippet I'll post as a possible solution.
Namely, I'm adding a transformer that simply transforms the array returned by the
SimpleImputer
instance in a DataFrame. Then - and most importantly - since it seems not possible to apply the vectorization on the 2D input that comes out of the previous two steps ('imputer'
and'dt'
) I'm adding a furtherColumnTransformer
which splits the vectorization in two parallel steps (a vectorization per column). Notice that at this point columns are referenced positionally as column names have possibly changed. Of course, that's a custom solution, but at least may provide some hints.Given that you don't actually have missing values, you can see that it actually works by comparing it with the output from:
print(ct.named_transformers_['tf1'].kw_args['vectorizer_tf'].vocabulary_) print(ct.named_transformers_['tf2'].kw_args['vectorizer_tf'].vocabulary_)
and noticing that columns from fourth to the last but one of the previous output (namely those affected by the application of
'cat_tf'
) do coincide with the ones just below.Here are a couple of posts with focus on the usage of
CountVectorizer
in aColumnTransformer
instance, though they did not consider imputing the dataset beforehand.