ngram vectorization - if new token found which not exists in corpus, what should I do with it

196 Views Asked by Ph0en1x At 20 October 2016 at 13:38

I'm building custom ngram vectorizer for bag of word model. I'm qurious - what should I do if during vectorizing of a short text I found new token, which not exists in corpus vocabulary. Should it be just skipped or what?

Original Q&A

There are 1 best solutions below

Aaron On 21 October 2016 at 00:14 BEST ANSWER

You can either skip it or you can add a special token to the vocabulary for unknown words, e.g. previously unseen words are replaced with "UNK" and then you can count them just the same as any other word. Also, to deal with the problem of not having any UNKs in the training data, you can replace all words that only occur once in the corpus with UNK.

ngram vectorization - if new token found which not exists in corpus, what should I do with it

There are 1 best solutions below

Related Questions in NLP

Related Questions in VECTORIZATION

Related Questions in DICTVECTORIZER

Trending Questions

Popular # Hahtags

Popular Questions