I am checking on which words the SpaCy Spanish lemmatizer works on using the .has_vector method. In the two columns of the datafame I have the output of the function that indicates which words can be lemmatized and in the other one the corresponding phrase.
I would like to know how I can extract all the words that have False output to correct them so that I can lemmatize.
So I created the function:
def lemmatizer(text):
doc = nlp(text)
return ' '.join([str(word.has_vector) for word in doc])
And applied it to the column sentences in the DataFrame
df["Vectors"] = df.reviews.apply(lemmatizer)
And put in another data frame as:
df2= pd.DataFrame(df[['Vectors', 'reviews']])
The output is
index Vectors reviews
1 True True True False 'La pelicula es aburridora'
Two ways to do this:
If you want to use has_vector:
Alternatively you can use the is_oov attribute:
Then as you already did:
Which will return:
Note:
When working with both of these ways it is important to know that this is model dependent, and usually has no backbone in smaller models and will always return a default value!
That means when you run the exact same code but e.g. with
en_core_web_sm
you get this:Which is because
has_vector
has a default value ofFalse
and is then not set by the model.is_oov
has a default value ofTrue
and then is not by the model either. So with thehas_vector
model it wrongly shows all words as unknown and withis_oov
it wrongly shows all as known.