I'm wondering how best to define parameters for datamapper transforms in a pipeline using pandas-sklearn.
Here is a reproducible example notebook using titanic data.
I'm setting it up as:
# use pandas sklearn to do some preprocessing
full_mapper = DataFrameMapper([
('Name', Pipeline([ ('name_vect', CountVectorizer()) , ('name_tfidf', TfidfTransformer()) ]) ),
('Ticket', Pipeline([ ('ticket_vect', CountVectorizer()) , ('ticket_tfidf', TfidfTransformer()) ]) ),
('Sex', LabelBinarizer()),
(['Age', 'Fare'], None), # i tried to use Impute() but got an error
])
I'd like to also cross validate the params in the CountVectorizer() and TfidfTransformer() that i'm using on the 'Name' and 'Ticket' fields.
However in defining my pipeline as:
# build full pipeline
full_pipeline = Pipeline([
('mapper',full_mapper),
('clf', SGDClassifier(n_iter=15, warm_start=True))
])
And then my params as:
# determine full param search space (need to get the params for the mapper parts in here somehow)
full_params = {'clf__alpha': [1e-2,1e-3,1e-4],
'clf__loss':['modified_huber','hinge'],
'clf__penalty':['l2','l1']}
I'm not sure how to include in the above options to go to 'name_vect', 'name_tfidf' etc.
I could not really find an example similar to what i'm trying to do here in the pandas-sklearn docs.
Note: just using the titanic data here for reproducibility. Really just trying to get the plumbing working here.
UPDATE (trying to adapt approach from here.)
If i do:
# make pipeline for individual variables
name_to_tfidf = Pipeline([ ('name_vect', CountVectorizer()) , ('name_tfidf', TfidfTransformer()) ])
ticket_to_tfidf = Pipeline([ ('ticket_vect', CountVectorizer()) , ('ticket_tfidf', TfidfTransformer()) ])
# data frame mapper
full_mapper = DataFrameMapper([
('Name', name_to_tfidf ),
('Ticket', ticket_to_tfidf ),
('Sex', LabelBinarizer()),
(['Age', 'Fare'], None), # i tried to use Impute() but got an error
])
# build full pipeline
full_pipeline = Pipeline([
('mapper',full_mapper),
('clf', SGDClassifier(n_iter=15, warm_start=True))
])
# determine full param search space
full_params = {'clf__alpha': [1e-2,1e-3,1e-4],
'clf__loss':['modified_huber','hinge'],
'clf__penalty':['l2','l1'],
# now set the params for the datamapper part of the pipeline
'mapper__features':[[
('Name',deepcopy(name_to_tfidf).set_params(name_vect__analyzer = 'char_wb')), # How can i set up a list for searching in here
('Ticket',deepcopy(ticket_to_tfidf).set_params(ticket_vect__analyzer = 'char')) # How can i set up a list for searching in here
]]
}
# set up grid search
gs_clf = GridSearchCV(full_pipeline, full_params, n_jobs=-1)
# do the fit
gs_clf.fit(df,df['Survived'])
print("Best score: %0.3f" % gs_clf.best_score_)
print("Best parameters set:")
best_parameters = gs_clf.best_estimator_.get_params()
for param_name in sorted(full_params.keys()):
print("\t%s: %r" % (param_name, best_parameters[param_name]))
Then i get:
> Best score: 0.746 Best parameters set: clf__alpha: 0.01 clf__loss:
> 'modified_huber' clf__penalty: 'l1' mapper__features: [('Name',
> Pipeline(memory=None,
> steps=[('name_vect', CountVectorizer(analyzer='char_wb', binary=False, decode_error='strict',
> dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
> lowercase=True, max_df=1.0, max_features=None, min_df=1,
> ngram_range=(1, 1), preprocessor=None, stop_words=None,
> strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
> tokenizer=None, vocabulary=None)), ('name_tfidf', TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False,
> use_idf=True))])), ('Ticket', Pipeline(memory=None,
> steps=[('ticket_vect', CountVectorizer(analyzer='char', binary=False, decode_error='strict',
> dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
> lowercase=True, max_df=1.0, max_features=None, min_df=1,
> ngram_range=(1, 1), preprocessor=None, stop_words=None,
> strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
> tokenizer=None, vocabulary=None)), ('ticket_tfidf', TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False,
> use_idf=True))]))]
So it looks like i am able to set the params here. However if i pass a list in like:
# determine full param search space (need to get the params for the mapper parts in here somehow)
full_params = {'clf__alpha': [1e-2,1e-3,1e-4],
'clf__loss':['modified_huber','hinge'],
'clf__penalty':['l2','l1'],
# now set the params for the datamapper part of the pipeline
'mapper__features':[[
('Name',deepcopy(name_to_tfidf).set_params(name_vect__analyzer = ['char', 'char_wb'])),
('Ticket',deepcopy(ticket_to_tfidf).set_params(ticket_vect__analyzer = ['char', 'char_wb']))
]]
}
I get error such as:
C:\Users\Andrew\Miniconda3\lib\site-packages\sklearn\feature_extraction\text.py in build_analyzer(self=CountVectorizer(analyzer=['char', 'char_wb'], bi...)\\b\\w\\w+\\b', tokenizer=None, vocabulary=None))
265 return lambda doc: self._word_ngrams(
266 tokenize(preprocess(self.decode(doc))), stop_words)
267
268 else:
269 raise ValueError('%s is not a valid tokenization scheme/analyzer' %
--> 270 self.analyzer)
self.analyzer = ['char', 'char_wb']
271
272 def _validate_vocabulary(self):
273 vocabulary = self.vocabulary
274 if vocabulary is not None:
ValueError: ['char', 'char_wb'] is not a valid tokenization scheme/analyzer
So unsure how to set the params of DataFrameMapper transfomations to options for the CV to search over.
Surely there must be a way. Agree though at this stage might be better to go pandas > numpy > FeatureUnion...
That's just one of the drawbacks I also experienced with the sklearn-pandas package. However, I found that writing your own transformer classes gives you full control over what's happening in your pipelines and even in feature unions.
One can customize each sklearn transformer to select only certain pandas columns and even output then the transformation as pandas dataframe with some tweaks.
See my blog for a comprehensive tour: https://wkirgsn.github.io/2018/02/15/pandas-pipelines/