Goal: First split a token into two tokens. Then use SpanRuler to label both the re-tokenized tokens as a single span with one label.

Problem: The labeled span consists of the original text (a single token) rather than the two tokens concatenated with a separating space (ie after re-tokenization).

What I did:

  1. I add a custom tokenizer splitter as the first stage. It correctly splits the single token into two tokens.

  2. I then detect the two (splitted) tokens using a SpanRuler. Notice that the SpanRuler works for a pattern of two separated tokens (ie pattern=['abc', 'efg']), and will correctly detect nothing if the pattern is the original single token (pattern='abcefg').

Notice the custom retokenizer does respect Spacy's non-destructive retokenization.

Thanks for any help.

Minimal Reproducible Example:

import spacy
from spacy.language import Language

@Language.component('splitter')
def splitter(doc):
    with doc.retokenize() as retokenizer:
        retokenizer.split(doc[0], ['abc', 'efg'], heads=[doc[0], doc[0]])
    return doc

nlp = spacy.load('en_core_web_sm'])
nlp.add_pipe('splitter', first=True)
sp_ruler = nlp.add_pipe('span_ruler')
sp_ruler.add_patterns([{'label': 'testing', 'pattern': [{'TEXT': 'abc'}, {'TEXT': 'efg'}]}])
    
doc = nlp('abcefg')

print([(tok.text, i) for i, tok in enumerate(doc)])
print([(type(span), span.text, span.label_) for span in doc.spans["ruler"]])
print(len(doc.spans['ruler']))

Actual Output:

> [('abc', 0), ('efg', 1)]
> [(<class 'spacy.tokens.span.Span'>, 'abcefg', 'testing')]
> 1

Expected output:

> [('abc', 0), ('efg', 1)]
> [(<class 'spacy.tokens.span.Span'>, 'abc efg', 'testing')]  # notice the space in the text, expected due to custom re-tokenization
> 1
0

There are 0 best solutions below