Custom segmentation and override segmentation rules in spacy

325 Views Asked by At

I want to split into sentences a large corpus (.txt) with a custom rule i.e. {SENT} using Spacy 3.1.

My main issue is that I want to "disable" the segmentation from the pretrained spacy models with spacy i.e. en_core_web_lg but keep all the other components (tokenisation, syntactic parser, ner etc.). I am always using the large models (I read that the segmentation may behave differently based on the model used).

Is there a way to override the existing rules and only using {SENT} as a delimiter while maintaining the rest of the pipe ? If I add the custom segmentation to the pipe before parser : nlp.add_pipe(set_custom_segmentation, before='parser'), will the parser resplit the sentences based on the delimiters provided from the models ?

I already tried the following with no luck :

@Language.component("segm")
def set_custom_segmentation(doc):
    for token in doc[:-1]:
        if token.text == '{SENT}':
            doc[token.i+1].is_sent_start = False
    return doc

nlp.add_pipe('segm', before='parser')

Solutions I've tried until now but didn't work :

  1. provide to spacy a list of "sentences" with split("{SENT}") and re.split("{SENT}")
  2. The answer proposed here
1

There are 1 best solutions below

3
On

You have to also set token.is_sent_start = False for the remaining tokens if you don't want the parser to potentially add additional sentence boundaries.