I want to split into sentences a large corpus (.txt) with a custom rule i.e. {SENT} using Spacy 3.1.
My main issue is that I want to "disable" the segmentation from the pretrained spacy models with spacy i.e. en_core_web_lg
but keep all the other components (tokenisation, syntactic parser, ner etc.). I am always using the large models (I read that the segmentation may behave differently based on the model used).
Is there a way to override the existing rules and only using {SENT} as a delimiter while maintaining the rest of the pipe ?
If I add the custom segmentation to the pipe before parser : nlp.add_pipe(set_custom_segmentation, before='parser')
, will the parser resplit the sentences based on the delimiters provided from the models ?
I already tried the following with no luck :
@Language.component("segm")
def set_custom_segmentation(doc):
for token in doc[:-1]:
if token.text == '{SENT}':
doc[token.i+1].is_sent_start = False
return doc
nlp.add_pipe('segm', before='parser')
Solutions I've tried until now but didn't work :
- provide to spacy a list of "sentences" with
split("{SENT}")
andre.split("{SENT}")
- The answer proposed here
You have to also set
token.is_sent_start = False
for the remaining tokens if you don't want the parser to potentially add additional sentence boundaries.