I am going through the Sample pipeline for text feature extraction and evaluation example from the scikit-learn
documentation. In there, they show the following pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
pipeline = Pipeline(
[
("vect", CountVectorizer()),
("tfidf", TfidfTransformer()),
("clf", SGDClassifier()),
]
)
which they later proceed to use with GridSearchCV
. In the example they fit the CountVectorizer
on the training dataset and then extract the features. What I am looking to do is to fit the CountVectorizer
on a bigger corpus and then apply it to the training data to obtain the feature vectors. Is there a straightforward way of doing so while maintaining the sklearn.pipeline.Pipeline
API i.e., without subclassing sklearn.pipeline.Pipeline
and significantly changing its methods?
I want to maintain the sklearn.pipeline.Pipeline
API as I am looking to make use of GridSearchCV
and having it structured in this manner will be quite convenient and clean.
Note if you have fixed list of keyword than directly you can pass in your vocab , but if you want to train and do feature selection and train it and than use that vocabulary in your training dataset
In the documentation it is given how to use GridsearchCv with Pipeline https://scikit-learn.org/stable/auto_examples/model_selection/grid_search_text_feature_extraction.html#sphx-glr-auto-examples-model-selection-grid-search-text-feature-extraction-py
set the parameter according to your need and pass it in GridSearchCV