Optimise nested loops to populate dictionary

71 Views Asked by At

I have a list of sentences, e.g. sentences = ["Mary likes Facebook", "Chris likes Whatsapp"]

I want to create a list of dictionaries that extracts entities and their types from all of these sentences. For example:

[
{'entity': 'Mary', 'type':'PERS'},
{'entity': 'Facebook', 'type':'ORG'},
{'entity': 'Chris', 'type':'PERS'},
{'entity': 'Whatsapp', 'type':'ORG'}
]

At the moment I'm using nested for loops to achieve this using Flair:

for sent in sentences:
   for entity in sent.get_spans("ner"):
      entity_list.append(
         {
            "entity": entity.text,
            "type": entity.tag
         }
      )

Is there a way to optimise the above and reduce the time complexity?

2

There are 2 best solutions below

0
On

I don't think you can get away from the nested loops reduce/time complexity, but perhaps you can use multiprocessing to speed up wall time by parallelizing the NER tagging of the sentences?

from flair.data import Sentence
from flair.models import SequenceTagger
from multiprocessing import Pool

# Global variable for the tagger
tagger = None

def init_worker():
    """Initialize the worker."""
    global tagger
    tagger = SequenceTagger.load('ner')

def tag_sentence(sentence):
    """Perform NER tagging on a single sentence."""
    sentence = Sentence(sentence)
    tagger.predict(sentence)
    return [
      {"entity": entity.text, "type": entity.tag}
      for entity in sentence.get_spans("ner")
    ]

def main() -> None:
    sentences = ["Mary likes Facebook", "Chris likes Whatsapp"]
    with Pool(initializer=init_worker) as p:
      results = p.map(tag_sentence, sentences)
    entity_list = [item for sublist in results for item in sublist]
    print(entity_list)

if __name__ == "__main__":
    main()
0
On

There is no way to reduce the time complexity. You have N sentences e M entities for each sentence, so the nested loop is necessary.

You could however "hide" the loops by using a list comprehension:

entity_list = [
    {'entity': entity.text, 'type': entity.tag} \
    for sentence in sentences \
    for entity in sentence.get_spans('ner')
]

Further reading: Are list-comprehensions and functional functions faster than "for loops"?