Custom segmentation and override segmentation rules in spacy

334 Views Asked by Artemis At 01 July 2025 at 10:50

I want to split into sentences a large corpus (.txt) with a custom rule i.e. {SENT} using Spacy 3.1.

My main issue is that I want to "disable" the segmentation from the pretrained spacy models with spacy i.e. en_core_web_lg but keep all the other components (tokenisation, syntactic parser, ner etc.). I am always using the large models (I read that the segmentation may behave differently based on the model used).

Is there a way to override the existing rules and only using {SENT} as a delimiter while maintaining the rest of the pipe ? If I add the custom segmentation to the pipe before parser : nlp.add_pipe(set_custom_segmentation, before='parser'), will the parser resplit the sentences based on the delimiters provided from the models ?

I already tried the following with no luck :

@Language.component("segm")
def set_custom_segmentation(doc):
    for token in doc[:-1]:
        if token.text == '{SENT}':
            doc[token.i+1].is_sent_start = False
    return doc

nlp.add_pipe('segm', before='parser')

Solutions I've tried until now but didn't work :

provide to spacy a list of "sentences" with split("{SENT}") and re.split("{SENT}")
The answer proposed here

Original Q&A

There are 1 best solutions below

aab On 13 April 2022 at 08:04

You have to also set token.is_sent_start = False for the remaining tokens if you don't want the parser to potentially add additional sentence boundaries.

Custom segmentation and override segmentation rules in spacy

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in SPACY

Related Questions in SPACY-3

Related Questions in TEXT-SEGMENTATION

Trending Questions

Popular # Hahtags

Popular Questions