Does anyone have any suggestions how to approach using w2v (using tensorflow not gensim) with a corpus that contains both compound and non-compound nouns? Specifically around animal names (in English)? For example "red panda", "flying fox", "elephant seal", while the corpus does also contain "panda", "fox", "elephant" and "seal", so I'd want these to be separate tokens.
Any ideas?
Word2vec only knows the tokens you pass it. If you want some multiword entities to be a single learned word-vector, you need to preprocess your text to combine those, for example changing the original 2 tokens
['red', 'panda'](as they appear in a larger context) into['red_panda']instead.There are many potential ways to do this. Gensim includes a
Phrasestool which can use statistical measures of how often pairs of tokens appear together, versus independently, to merge pairs (aka 'bigrams') into single tokens – with tunable thresholds. (This technique was described alongside the original word2vec publication.) Running this "pair up" multiple times can even construct trigrams, quadgrams, etc – and the resulting text doe often work better for various classification/info-retrieval goals.But, the results are often non-aesthetic and no matter how well it's tuned, there will be some multigrams that don't make sense to a human reviewer, and others you wish it'd combined that it is not sophisticated enough to do. So you wouldn't really show the results to end-users/lay-user, just use as the raw text for other evaluable processes.
There are many other systems for 'entity recognition' or 'noun phrase detection' that might work, depending on your aims. If your domain is sufficiently well-understood that you can list all such desired combos – eg all common two-word animal types – a mechanistic replace might be sufficient. Deeper-neural-network language models, with their better understanding of grammar and context, can also coalesce multi-word entities into single tokens (at their much higher training/inference resource requirements).