How to efficiently load 1m+ patterns into SpaCy Matcher at runtime?

37 Views Asked by At

I've gone back and forth through the docs, looked through Stackoverflow questions, blogs and all sorts of materials and I'm stuck.

I have written a custom matcher that pulls about 3M entries out of my database and creates token matcher patterns that go into a SpaCy Matcher component. I've created a JSONL file with all of the patterns. The problem is that I am not sure how to pre-load the pipeline with my patterns at runtime so that I can just call the matcher behind an endpoint in a deployed service. Is it possible with that many patterns added to it? Is there a way to save the state to disk and then reload it at runtime?

I created my own Language Factory stateful component as per the below with to_disk and from_disk methods as specified in the documentation.

@Language.factory("dbmatcher", default_config={"data_path": "./company_patterns.jsonl"})
def create_dbmatcher_component(nlp: Language, name: str, data_path: str):
    """Adds dbmatcher to the spacy pipeline"""
    patterns = load_jsonl(data_path)
    return DBMatcher(nlp, patterns)

class DBMatcher:
    def __init__(self, nlp, patterns):
        self.matcher = Matcher(nlp.vocab)
        self.patterns = patterns
        for pattern in patterns:
            self.matcher.add(pattern["id"], pattern["patterns"], greedy='LONGEST')
        #Token.set_extension("db_match",default=None)
        if not Doc.has_extension("matches"):
          Doc.set_extension("matches", default=[])

    def __call__(self, doc):
        matches = self.matcher(doc)
        for match_id, start, end in matches:
            span = doc[start:end]
            doc._.matches.append((span, match_id))
        return doc

    def to_disk(self, path, exclude=tuple()):
        path = ensure_path(path)
        if not path.exists():
            path.mkdir()
        srsly.write_json(path / "data.json", self.patterns)

    def from_disk(self, path, exclude=tuple()):
        self.data = srsly.read_json(path / "data.json")
        return self

    def initialize(self, get_examples=None, nlp=None, data={}):
        self.data = data

How do I avoid this from happening?

I first create a pipeline with the component and write the entire pipeline to disk, as so:

"""Writes spacy pipeline with db matcher to disk for fast instantiation later."""
nlp = spacy.load("en_core_web_lg")
nlp.add_pipe("dbmatcher", config={"data_path": "./company_patterns.jsonl"}, last=True)
nlp.to_disk("company_matcher")

The idea is that by writing it to disk, and then calling spacy.load(...) the pipeline has already been configured with all of the patterns saved to my matching component so at runtime it can load efficiently. But its still taking forever.

Why is this happening? Is there any way to efficiently create a matcher that will run in production with over 1M patterns added to it?

Here is my code for attempting to load and run the pipeline, which doesn't ever run because it seems to get stuck loading patterns:

from matcher_pipeline import DBMatcher

nlp = spacy.load("company_matcher")
doc = nlp("This is a test with OpenAI")
print(doc._.matches)
0

There are 0 best solutions below