Spacy v3 custom sentence segmentation

1k Views Asked by Artemis At 28 July 2025 at 01:52

I want to split into sentences a large corpus (.txt) only using a custom delimiter i.e. {S} . I am working with Spacy 3.1.

Taking as an example the following sentence, which should be considered as one :

{S} — Quel âge as -tu? demanda Angel. — Je ne sais pas, — Sais -tu faire la soupe ?{S}

Spacy returns :

{S}
—
Quel âge as
-tu?
demanda Angel.
— Je ne sais pas, —
Sais
-tu faire la soupe ?

I have already tried the following with no luck :

@Language.component("segm")
def set_custom_segmentation(doc):
    for token in doc[:-1]:
        if token.text == '{S}':
            doc[token.i+1].is_sent_start = False
    return doc

nlp.add_pipe('segm', first=True)

as well as a rule to consider {S} as a single token :

special_case = [{ORTH: "{S}"}]
nlp.tokenizer.add_special_case("{S}", special_case)

Original Q&A

There are 1 best solutions below

aab On 21 April 2022 at 07:37

You want to use token.is_sent_start = True to add sentence boundaries, so something more like:

@Language.component("segm")
def set_custom_segmentation(doc):
    for token in doc[:-1]:
        if token.text == '{S}':
            doc[token.i+1].is_sent_start = True
        else:
            doc[token.i+1].is_sent_start = False
    return doc

Spacy v3 custom sentence segmentation

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in SPACY

Related Questions in SENTENCE

Related Questions in SPACY-3

Trending Questions

Popular # Hahtags

Popular Questions