sklearn.pipeline.Pipeline: Fitting CountVectorizer in different corpus than training text

582 Views Asked by At

I am going through the Sample pipeline for text feature extraction and evaluation example from the scikit-learn documentation. In there, they show the following pipeline

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier

pipeline = Pipeline(
    [
        ("vect", CountVectorizer()),
        ("tfidf", TfidfTransformer()),
        ("clf", SGDClassifier()),
    ]
)

which they later proceed to use with GridSearchCV. In the example they fit the CountVectorizer on the training dataset and then extract the features. What I am looking to do is to fit the CountVectorizer on a bigger corpus and then apply it to the training data to obtain the feature vectors. Is there a straightforward way of doing so while maintaining the sklearn.pipeline.Pipeline API i.e., without subclassing sklearn.pipeline.Pipeline and significantly changing its methods?

I want to maintain the sklearn.pipeline.Pipeline API as I am looking to make use of GridSearchCV and having it structured in this manner will be quite convenient and clean.

1

There are 1 best solutions below

3
On
 from sklearn.feature_extraction.text import CountVectorizer
 # supppose corpus is your big corpus 
  corpus = [
 'This is the first document.',
 'This document is the second document.',
 'And this is the third one.',
 'Is this the first document?',]
 
 # first train it on big corpus , and get the feature name from that
 vectorizer = CountVectorizer()
 X = vectorizer.fit_transform(corpus)

# now train your new dataset using the vocabulary from the above training datasert

 vocabulary  = vectorizer.get_feature_names() 

 new_train_corpus = ["how are you doing", "I am fine", "I am reading first document"]
 new_vect = CountVectorizer(vocabulary = vocabulary) #using vocabulary from previous training here 
 new_vect.fit_transform(new_train_corpus)

 new_vect.get_feature_names()
 #op all new vocabulary will get ignored , and vectorizer object will used only this vocabulary 
 
 ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

Note if you have fixed list of keyword than directly you can pass in your vocab , but if you want to train and do feature selection and train it and than use that vocabulary in your training dataset

In the documentation it is given how to use GridsearchCv with Pipeline https://scikit-learn.org/stable/auto_examples/model_selection/grid_search_text_feature_extraction.html#sphx-glr-auto-examples-model-selection-grid-search-text-feature-extraction-py

pipeline = Pipeline(
[
    ("vect", CountVectorizer(vocabulary = vocabulary)), ## pass vocabulary here
    ("tfidf", TfidfTransformer()),
    ("clf", yourmodel()),
]
 ) 

set the parameter according to your need and pass it in GridSearchCV

 grid_search = GridSearchCV(pipeline, parameters)