Entities containing underscore character are split into multiple entities by TokensAnnotation in CoreNLP

119 Views Asked by At

I am observing that coreNLP 3.9.2 has started splitting enti_ties into multiple ones like 'enti' , '_', 'ties' while tokenizing

I have tried to use the tokenize.whitespace which solves this problem. But I think this will stop splitting tokens for "cant't" and "dont't"

1

There are 1 best solutions below

0
On BEST ANSWER

One thing you can do is replace the underscores (_) with a period (.) and the parser (and tokenizer, I believe) will interpret it as one entity.

E.g. enti_ties > enti.ties where the latter is retained as one entity

This doesn't entirely resolve the problem, but serves as a workaround in a pinch.