How to identify n-gram before tokenization in stanford core-nlp?

365 Views Asked by Trinadh Gupta At 19 December 2016 at 22:08

I am trying to use the core-nlp annotation pipeline with default settings all through from tokenizing until ner_tags. I did observe that the "tokenizer" module is identifying , say "Vice President" as two individual tokens {vice,President} resulting in ner_tags identification as {o,TITLE} instead of {Vice President} and {TITLE}. How can I get the tokenizer to identify "Vice president" as one single token , that help Ner_Tags to identify titles appropriately.

Original Q&A

There are 1 best solutions below

Gabor Angeli On 22 December 2016 at 06:16

What properties are you using to get TITLE as an NER tag? This is not one of the standard tags, and if you're using the TokensRegexNER annotator (e.g., for the kbp annotator) multi-word titles like 'vice president' should be picked up. It works on corenlp.run at least.

In general, it's not the tokenizer's job to collapse NER spans into a single mention. The tokenizer should separate 'vice' and 'president' into different tokens, both of which should be marked TITLE by an appropriate NER annotator. You may be interested in the entitymention annotator, which groups contiguous NER tags into NER mentions -- this would give you 'vice president' as a single mention, rather than two tokens both marked as TITLE. These mentions can be retrieved using the mentions annotation on a sentence CoreMap, or using the List<String> mention(String nerTag) or List<String> mentions() functions in the simple API.

How to identify n-gram before tokenization in stanford core-nlp?

There are 1 best solutions below

Related Questions in NLP

Related Questions in TOKENIZE

Related Questions in NAMED-ENTITY-RECOGNITION

Related Questions in STANFORD-NLP

Trending Questions

Popular # Hahtags

Popular Questions