Using CountVectorizer with Pipeline and ColumnTransformer and getting AttributeError: 'numpy.ndarray' object has no attribute 'lower'

Question

Using CountVectorizer with Pipeline and ColumnTransformer and getting AttributeError: 'numpy.ndarray' object has no attribute 'lower'

1k Views Asked by DJL At 09 April 2022 at 06:18

I'm trying to use CountVectorizer() with Pipeline and ColumnTransformer. Because CountVectorizer() produces sparse matrix, I used FunctionTransformer to ensure the ColumnTransformer can hstack correctly when putting together the resulting matrix.

import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from typing import Callable

# Dataset
df = pd.DataFrame([['a', 'Hi Tom', 'It is hot', 1],
                    ['b', 'How you been Tom', 'hot coffee', 2],
                    ['c', 'Hi you', 'I want some coffee', 3]],
                   columns=['col_for_ohe', 'col_for_countvectorizer_1', 'col_for_countvectorizer_2', 'num_col'])

# Use FunctionTransformer to ensure dense matrix
def tf_text(X, vectorizer_tf: Callable):
    X_vect_ = vectorizer_tf.fit_transform(X)
    return X_vect_.toarray()

tf_transformer = FunctionTransformer(tf_text, kw_args={'vectorizer_tf': CountVectorizer()})

# Transformation Pipelines
tf_transformer_pipe = Pipeline(
    steps = [('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
             ('tf', tf_transformer)])

ohe_transformer_pipe = Pipeline(
    steps = [('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
             ('ohe', OneHotEncoder(handle_unknown='ignore', sparse=False))])

transformer = ColumnTransformer(transformers=[
    ('cat_ohe', ohe_transformer_pipe, ['col_for_ohe']),
    ('cat_tf', tf_transformer_pipe, ['col_for_countvectorizer_1', 'col_for_countvectorizer_2'])
], remainder='passthrough')

transformed_df = transformer.fit_transform(df)

I get AttributeError: 'numpy.ndarray' object has no attribute 'lower.' I've seen this question and suspect CountVectorizer() is the culprit but not sure how to solve it (previous question doesn't use ColumnTransformer). I stumbled upon a DenseTransformer that I wish I could use instead of FunctionTransformer but unfortunately it is not supported in my company.

Original Q&A

There are 3 best solutions below

UNPLUGGED 最後の考え On 09 April 2022 at 06:20

I think you should really look back over your basics again. Your question tells me you don’t understand the function well enough to implement it effectively. Ask again when you’ve done enough research on your own to not embarrass yourself.

Aishwarya Patange On 09 April 2022 at 06:40

In CountVectorizer, pass lower_case=False.

**amiola** · Accepted Answer · 2022-04-09T18:03:02.733000

Imo, the first consideration to be done is that CountVectorizer() requires 1D input; your example is not working because the imputation is returning a 2D numpy array which means that you'll need to add a customized treatment to make it work.

Then you should also consider that when using a CountVectorizer() instance (which - again - requires 1D input) as transformer in a ColumnTransformer() that's how you should pass transformers' columns:

columns: str, array-like of str, int, array-like of int, array-like of bool, slice or callable

Indexes the data on its second axis. Integers are interpreted as positional columns, while strings can reference DataFrame columns by name. A scalar string or int should be used where transformer expects X to be a 1d array-like (vector), otherwise a 2d array will be passed to the transformer. [...]

This would be useful in interpreting the snippet I'll post as a possible solution.

import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from typing import Callable
from sklearn.base import BaseEstimator, TransformerMixin

# Dataset
df = pd.DataFrame([['a', 'Hi Tom', 'It is hot', 1],
                ['b', 'How you been Tom', 'hot coffee', 2],
                ['c', 'Hi you', 'I want some coffee', 3]],
               columns=['col_for_ohe', 'col_for_countvectorizer_1', 'col_for_countvectorizer_2', 'num_col'])

class DimTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    def fit(self, *_):
        return self
    def transform(self, X, *_):
        return pd.DataFrame(X)

# Use FunctionTransformer to ensure dense matrix
def tf_text(X, vectorizer_tf: Callable):
    X_vect_ = vectorizer_tf.fit_transform(X)
    return X_vect_.toarray()

tf_transformer = FunctionTransformer(tf_text, kw_args={'vectorizer_tf': CountVectorizer()})

# Transformation Pipelines
tf_transformer_pipe = Pipeline(
    steps = [('imputer', SimpleImputer(strategy='constant', fill_value='missing')), 
             ('dt', DimTransformer()),
             ('ct', ColumnTransformer([
                 ('tf1', tf_transformer, 0), 
                 ('tf2', tf_transformer, 1)
             ]))    
])

ohe_transformer_pipe = Pipeline(
    steps = [('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
             ('ohe', OneHotEncoder(handle_unknown='ignore', sparse=False))])

transformer = ColumnTransformer(transformers=[
    ('cat_ohe', ohe_transformer_pipe, ['col_for_ohe']),
    ('cat_tf', tf_transformer_pipe, ['col_for_countvectorizer_1', 'col_for_countvectorizer_2'])
], remainder='passthrough')

transformed_df = transformer.fit_transform(df)

Namely, I'm adding a transformer that simply transforms the array returned by the SimpleImputer instance in a DataFrame. Then - and most importantly - since it seems not possible to apply the vectorization on the 2D input that comes out of the previous two steps ('imputer' and 'dt') I'm adding a further ColumnTransformer which splits the vectorization in two parallel steps (a vectorization per column). Notice that at this point columns are referenced positionally as column names have possibly changed. Of course, that's a custom solution, but at least may provide some hints.

Given that you don't actually have missing values, you can see that it actually works by comparing it with the output from:

dt = DimTransformer().fit_transform(df)
ct = ColumnTransformer([
    ('tf1', tf_transformer, 1), 
    ('tf2', tf_transformer, 2)
])
ct.fit_transform(dt)

print(ct.named_transformers_['tf1'].kw_args['vectorizer_tf'].vocabulary_) print(ct.named_transformers_['tf2'].kw_args['vectorizer_tf'].vocabulary_)

and noticing that columns from fourth to the last but one of the previous output (namely those affected by the application of 'cat_tf') do coincide with the ones just below.

Here are a couple of posts with focus on the usage of CountVectorizer in a ColumnTransformer instance, though they did not consider imputing the dataset beforehand.

Using CountVectorizer with Pipeline and ColumnTransformer and getting AttributeError: 'numpy.ndarray' object has no attribute 'lower'

There are 3 best solutions below

Related Questions in PYTHON

Related Questions in SCIKIT-LEARN

Related Questions in PIPELINE

Related Questions in SPARSE-MATRIX

Related Questions in COUNTVECTORIZER

Trending Questions

Popular # Hahtags

Popular Questions