I have a data frame with sentences and the respective part of speech tag for each word (Below is an extract of the data I'm working with (data taken from SNLI corpus). For each sentence in my collection I would like to extract unigrams and the corresponding pos-tag of that word.

For instance if I've the following:

vectorizer_unigram = CountVectorizer(analyzer='word', ngram_range=(1, 1), stop_words = 'english')

doc = {'sent' : ['Two women are embracing while holding to go packages .'], 'tags' : ['NUM NOUN AUX VERB SCONJ VERB PART VERB NOUN PUNCT']}

sentence = vectorizer_unigram.fit(doc['sent'])
sentence_unigrams = sentence.get_feature_names_out()

Then I would get the following unigrams output:

array(['embracing', 'holding', 'packages', 'women'], dtype=object)

But I don't know how to retain the part of speech tag after this. I tried to do a lookup version with the unigrams, but as they may differ from the words in the sentence (if you for instance do sentence.split(' ')) you don't necessarily get the same tokens. Any suggestions of how I can extract unigrams and retain the corresponding part-of-speech tag?

1

There are 1 best solutions below

2
Kyle F. Hartzenberg On

After reviewing the source code for the sklearn CountVectorizer class, particularly the fit function, I don't believe the class has any way of tracking the original document element indexes relative to the extracted unigram features: where the unigram features do not necessarily have the same tokens. Other than the simple solution provided below, you might have to rely on some other method/library to achieve your desired results. If there is a particular case that fails, I'd suggest adding that to your question as it might help people generate solutions to your problem.

from sklearn.feature_extraction.text import CountVectorizer

vectorizer_unigram = CountVectorizer(analyzer='word', ngram_range=(1, 1), stop_words = 'english')

doc = {'sent': ['Two women are embracing while holding to go packages .'],
       'tags': ['NUM NOUN AUX VERB SCONJ VERB PART VERB NOUN PUNCT']}

sentence = vectorizer_unigram.fit(doc['sent'])
sentence_unigrams = sentence.get_feature_names_out()

sent_token_list = doc['sent'][0].split()
tags_token_list = doc['tags'][0].split()
sentence_tags = []

for unigram in sentence_unigrams:
    for i in range(len(sent_token_list)):
        if sent_token_list[i] == unigram:
            sentence_tags.append(tags_token_list[i])

print(sentence_unigrams)
# Output: ['embracing' 'holding' 'packages' 'women']
print(sentence_tags)
# Output: ['VERB', 'VERB', 'NOUN', 'NOUN']