Spacy is not searching for an specific pattern

37 Views Asked by At

Given the sentence:

txt = "Os edifícios multifamiliares devem ser providos de proteção contra descargas atmosféricas, atendendo ao estabelecido na ABNT NBR 5419 e demais Normas Brasileiras aplicáveis, nos casos previstos na legislação vigente."

nlp = spacy.load("pt_core_news_md")
doc = nlp(txt)

And this dictionary with patterns and labels:

patterns= [
{"label": "COMPONENTE", "pattern": [
        [{"POS": "NOUN"},{"POS": "ADP"},{"POS": "NOUN"},{"POS": "ADJ"}],
        [{"POS": "NOUN"},{"POS": "ADP"},{"POS": "ADJ"}],
        [{"POS": "NOUN"},{"POS": "ADP"},{"POS": "NOUN"}], #<<< the problem
        [{"POS": "NOUN", "DEP":"nsubj"},{"POS": "ADJ"},{"POS": "ADJ"}],
        [{"POS": "NOUN", "DEP":"nsubj"}],
        [{"POS": "NOUN"},{"POS": "ADJ"}]
    ]}

I am trying to use this funcion to search across the sentence and not repeating the words for next pattern search:

from spacy.matcher import Matcher
from spacy.tokens import Span

def buscar_padroes_sequencialmente(doc, patterns):
    resultados = []
    tokens_processados = set()

    for pat in patterns:
        label = pat["label"]
        matcher = Matcher(doc.vocab)
        
        for i, padrao_atual in enumerate(pat["pattern"]):
            matcher.add(f"{label}", [padrao_atual])

        for padrao_id, inicio, fim in matcher(doc):
            rótulo = matcher.vocab.strings[padrao_id]

            # Verify is any token was processed before
            if any(token.i in tokens_processados for token in doc[inicio:fim]):
                continue

            # Add pattern tokens to the variable tokens_processados
            tokens_processados.update(token.i for token in doc[inicio:fim])

            # tokens to span
            span = Span(doc, inicio, fim, label=rótulo)
            resultados.append((rótulo, span))

    return resultados

When I use the function and print the results:

resultados = buscar_padroes_sequencialmente(doc, patterns)

print("Frase:", txt)
for i, (rotulo, span) in enumerate(resultados, start=1):
    
    pos_tokens = [token.pos_ for token in span]

    print(f"OSemantic {i}:", span.text, f'({rotulo})')
    print("POStoken:", pos_tokens)
    print()

I was expecting to get some results, but specifically this:

OSemantic 4: proteção contra descargas atmosféricas (COMPONENTE)
POStoken: ['NOUN', 'ADP', 'NOUN', 'ADJ']

But I got this:

OSemantic 4: proteção contra descargas (COMPONENTE)
POStoken: ['NOUN', 'ADP', 'NOUN']

So the code couldnt find the last token for the span under the pattern

[{"POS": "NOUN"},{"POS": "ADP"},{"POS": "NOUN"},{"POS": "ADJ"}]

Nothing happens if I change the order of the patterns either. Only if I remove pattern [{"POS": "NOUN"},{"POS": "ADP"},{"POS": "NOUN"}], then yes, it can search for the others.

1

There are 1 best solutions below

0
Douglas On

I managed to find an answer in the Matcher.add documentation. I tested inserting the optional argument "greedy" which receives two options "FIRST" or "LONGEST".

Strangely, when I use the "FIRST" option, it searches for the first pattern found in the sentence, which would not solve my problem considering that the problem is searching for correspondence according to the order of the patterns.... Strangely, it finds the correspondence by longest pattern.

But, with the "LONGEST" option it always searches for the longest pattern, which, in a way, solves my problem because it searches for the longest match in the phrase first.

Heres the code with answer:

matcher.add(f"{label}", [padrao_atual], greedy = "")