I'm implementing stanza's lemmatizer because it works well with spanish texts but the lemmatizer retuns a whole dictionary with ID and other characteristics I don't care about for the time being. I checked the "processors" in the pipeline but I don't seem to find and example where I just get the sence with the lemmatized text instead of the dictionary.
This is what I have:
stanza.download('es', package='ancora', processors='tokenize,mwt,pos,lemma', verbose=False)
stNLP = stanza.Pipeline(processors='tokenize,mwt,pos,lemma', lang='es', use_gpu=True)
stNLP('me hubiera gustado mas “sincronia” con la primaria')
Output:
[
[
{
"id": 1,
"text": "me",
"lemma": "yo",
"upos": "PRON",
"xpos": "pp1cs000",
"feats": "Case=Dat|Number=Sing|Person=1|PrepCase=Npr|PronType=Prs",
"start_char": 0,
"end_char": 2
},
....
Of course when I try to lemmatize my document it returns a lot of text I don't need at the moment, how can I just obtain the lemma? I'm aware I could possibly extract the word from the dictionary but it takes a lot of time as it is, what I want to avoid is giving the fuction extra work.
Thank you in advance.
I'm not entirely sure yet, but from what I've seen, it appears that Stanza's pipeline generates a nested structure in which each sentence is a list of tokens, and each token is akin to a dictionary containing various attributes such as ID, text, lemma, and so on.
It is easy to extract the lemmas by navigating this nested structure. Here's how I've done it.
Note: As of the time of writing, the version of Stanza being used is stanza==1.7.0