How to speed up Stanza lemmatizer by excluding reduntant words

138 Views Asked by At

Given:

I have a small sample document with limited number of words as follows:

d ='''
I go to school by the school bus everyday with all of my best friends. 
There are several students who also take the buses to school. Buses are quite cheap in my city.
The city which I live in has an enormous number of brilliant schools with smart students.
We have a nice math teacher in my school whose name is Jane Doe.
She also teaches several other topics in our school, including physics, chemistry and sometimes literature as a substitute teacher.
Other classes don't appreciate her efforts as much as my class. She must be nominated as the best school's teacher.
My school is located far from my apartment. This is why, I am taking the bus to school everyday.
'''

Goal:

Considering my real-world large document with more words (4000 ~ 8000 words), I would like to speed up my Stanza lemmatizer by probably excluding lemmatizing repeated words, e.g., words which has occurred more than once. I do not intend to use set() method to obtain the unique lemmas in my result list, rather I intend to ignore lemmatizing words which have already been lemmatized.

For instance, for the given sample raw document d, there are several redundant words which could be ignored in the process:

Word                 Lemma
--------------------------------------------------
school               school
school               school <<<<< Redundant
bus                  bus
everyday             everyday
friends              friend
students             student
buses                bus
school               school
Buses                bus <<<<< Redundant
cheap                cheap
city                 city
city                 city <<<<< Redundant
live                 live
enormous             enormous
number               number
brilliant            brilliant
schools              school
smart                smart
students             student
nice                 nice
math                 math
teacher              teacher
school               school <<<<< Redundant
Jane                 jane
Doe                  doe
teaches              teach
topics               topic
school               school <<<<< Redundant
including            include
physics              physics
chemistry            chemistry
literature           literature
substitute           substitute
teacher              teacher <<<<< Redundant
classes              class
appreciate           appreciate
efforts              effort
class                class
nominated            nominate
school               school <<<<< Redundant
teacher              teacher
school               school <<<<< Redundant
located              locate
apartment            apartment
bus                  bus
school               school <<<<< Redundant
everyday             everyday <<<<< Redundant

My [inefficient] solution:

import stanza
import nltk
nltk_modules = ['punkt',
                'averaged_perceptron_tagger',
                'stopwords',
                'wordnet',
                'omw-1.4',
               ]
nltk.download(nltk_modules, quiet=True, raise_on_error=True,)
STOPWORDS = nltk.corpus.stopwords.words(nltk.corpus.stopwords.fileids())

nlp = stanza.Pipeline(lang='en', processors='tokenize,lemma,pos', tokenize_no_ssplit=True,download_method=DownloadMethod.REUSE_RESOURCES)
doc = nlp(d)
%timeit -n 10000 [ wlm.lower() for _, s in enumerate(doc.sentences) for _, w in enumerate(s.words) if (wlm:=w.lemma) and len(wlm)>2 and wlm not in STOPWORDS]
10.5 ms ± 112 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

My [alternative] solution, a little faster but still NOT efficient for (4000 ~ 8000 words):

def get_lm():
  words_list = list()
  lemmas_list = list()
  for _, vsnt in enumerate(doc.sentences):
    for _, vw in enumerate(vsnt.words):
      wlm = vw.lemma.lower()
      wtxt = vw.text.lower()
      if wtxt in words_list and wlm in lemmas_list:
        lemmas_list.append(wlm)
      elif ( wtxt not in words_list and wlm and len(wlm) > 2 and wlm not in STOPWORDS ):
        lemmas_list.append(wlm)
      words_list.append(wtxt)
  return lemmas_list
%timeit -n 10000 get_lm()
7.85 ms ± 66.6 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

My ideal result for this sample document, from either solution, should look like this, containing even repeated lemmas:

lm = [ wlm.lower() for _, s in enumerate(doc.sentences) for _, w in enumerate(s.words) if (wlm:=w.lemma) and len(wlm)>2 and wlm not in STOPWORDS] # solution 1
# lm = get_lm() # solution 2
print(len(lm), lm)
47 ['school', 'school', 'bus', 'everyday', 'friend', 'student', 'bus', 'school', 'bus', 'cheap', 'city', 'city', 'live', 'enormous', 'number', 'brilliant', 'school', 'smart', 'student', 'nice', 'math', 'teacher', 'school', 'jane', 'doe', 'teach', 'topic', 'school', 'include', 'physics', 'chemistry', 'literature', 'substitute', 'teacher', 'class', 'appreciate', 'effort', 'class', 'nominate', 'school', 'teacher', 'school', 'locate', 'apartment', 'bus', 'school', 'everyday']

Is there any better or more efficient approach for this problem when considering large corpus or documents?

Cheers,

0

There are 0 best solutions below