Entities containing underscore character are split into multiple entities by TokensAnnotation in CoreNLP

149 Views Asked by Ishant Wankhede At 25 July 2019 at 13:33

I am observing that coreNLP 3.9.2 has started splitting enti_ties into multiple ones like 'enti' , '_', 'ties' while tokenizing

I have tried to use the tokenize.whitespace which solves this problem. But I think this will stop splitting tokens for "cant't" and "dont't"

There are 1 best solutions below

abstrakkt On 16 January 2020 at 04:10 BEST ANSWER

One thing you can do is replace the underscores (_) with a period (.) and the parser (and tokenizer, I believe) will interpret it as one entity.

E.g. enti_ties > enti.ties where the latter is retained as one entity

This doesn't entirely resolve the problem, but serves as a workaround in a pinch.