How to make stanza lemmatizer to return just the lemma instead of a dictionary?

154 Views Asked by trashparticle At 05 December 2023 at 23:30

I'm implementing stanza's lemmatizer because it works well with spanish texts but the lemmatizer retuns a whole dictionary with ID and other characteristics I don't care about for the time being. I checked the "processors" in the pipeline but I don't seem to find and example where I just get the sence with the lemmatized text instead of the dictionary.

This is what I have:

stanza.download('es', package='ancora', processors='tokenize,mwt,pos,lemma', verbose=False)
stNLP = stanza.Pipeline(processors='tokenize,mwt,pos,lemma', lang='es', use_gpu=True)
stNLP('me hubiera gustado mas “sincronia” con la primaria')

Output:

[
  [
    {
      "id": 1,
      "text": "me",
      "lemma": "yo",
      "upos": "PRON",
      "xpos": "pp1cs000",
      "feats": "Case=Dat|Number=Sing|Person=1|PrepCase=Npr|PronType=Prs",
      "start_char": 0,
      "end_char": 2
    },
....

Of course when I try to lemmatize my document it returns a lot of text I don't need at the moment, how can I just obtain the lemma? I'm aware I could possibly extract the word from the dictionary but it takes a lot of time as it is, what I want to avoid is giving the fuction extra work.

Thank you in advance.

Original Q&A

There are 1 best solutions below

Gianluca Calò On 06 December 2023 at 00:09 BEST ANSWER

I'm not entirely sure yet, but from what I've seen, it appears that Stanza's pipeline generates a nested structure in which each sentence is a list of tokens, and each token is akin to a dictionary containing various attributes such as ID, text, lemma, and so on.

It is easy to extract the lemmas by navigating this nested structure. Here's how I've done it.

stanza.download('es', package='ancora', processors='tokenize,mwt,pos,lemma', verbose=False)
stNLP = stanza.Pipeline(processors='tokenize,mwt,pos,lemma', lang='es', use_gpu=True)
doc = stNLP('me hubiera gustado mas “sincronia” con la primaria')
lemmas = [word.lemma for t in doc.iter_tokens() for word in t.words]

Note: As of the time of writing, the version of Stanza being used is stanza==1.7.0

How to make stanza lemmatizer to return just the lemma instead of a dictionary?

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in NLP

Related Questions in STANFORD-NLP

Related Questions in LEMMATIZATION

Trending Questions

Popular # Hahtags

Popular Questions