Retain original document element index of argument passed through sklearn's CountVectorizer() in order to access corresponding part of speech tag

58 Views Asked by OLGJ At 29 November 2022 at 08:39

I have a data frame with sentences and the respective part of speech tag for each word (Below is an extract of the data I'm working with (data taken from SNLI corpus). For each sentence in my collection I would like to extract unigrams and the corresponding pos-tag of that word.

For instance if I've the following:

vectorizer_unigram = CountVectorizer(analyzer='word', ngram_range=(1, 1), stop_words = 'english')

doc = {'sent' : ['Two women are embracing while holding to go packages .'], 'tags' : ['NUM NOUN AUX VERB SCONJ VERB PART VERB NOUN PUNCT']}

sentence = vectorizer_unigram.fit(doc['sent'])
sentence_unigrams = sentence.get_feature_names_out()

Then I would get the following unigrams output:

array(['embracing', 'holding', 'packages', 'women'], dtype=object)

But I don't know how to retain the part of speech tag after this. I tried to do a lookup version with the unigrams, but as they may differ from the words in the sentence (if you for instance do sentence.split(' ')) you don't necessarily get the same tokens. Any suggestions of how I can extract unigrams and retain the corresponding part-of-speech tag?

Original Q&A

There are 1 best solutions below

Kyle F. Hartzenberg On 29 November 2022 at 12:25

After reviewing the source code for the sklearn CountVectorizer class, particularly the fit function, I don't believe the class has any way of tracking the original document element indexes relative to the extracted unigram features: where the unigram features do not necessarily have the same tokens. Other than the simple solution provided below, you might have to rely on some other method/library to achieve your desired results. If there is a particular case that fails, I'd suggest adding that to your question as it might help people generate solutions to your problem.

from sklearn.feature_extraction.text import CountVectorizer

vectorizer_unigram = CountVectorizer(analyzer='word', ngram_range=(1, 1), stop_words = 'english')

doc = {'sent': ['Two women are embracing while holding to go packages .'],
       'tags': ['NUM NOUN AUX VERB SCONJ VERB PART VERB NOUN PUNCT']}

sentence = vectorizer_unigram.fit(doc['sent'])
sentence_unigrams = sentence.get_feature_names_out()

sent_token_list = doc['sent'][0].split()
tags_token_list = doc['tags'][0].split()
sentence_tags = []

for unigram in sentence_unigrams:
    for i in range(len(sent_token_list)):
        if sent_token_list[i] == unigram:
            sentence_tags.append(tags_token_list[i])

print(sentence_unigrams)
# Output: ['embracing' 'holding' 'packages' 'women']
print(sentence_tags)
# Output: ['VERB', 'VERB', 'NOUN', 'NOUN']

Retain original document element index of argument passed through sklearn's CountVectorizer() in order to access corresponding part of speech tag

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in SCIKIT-LEARN

Related Questions in NLP

Related Questions in STANFORD-NLP

Related Questions in COUNTVECTORIZER

Trending Questions

Popular # Hahtags

Popular Questions