Pre trained vectors, nlp, word2vec, word embedding for particular topic?

498 Views Asked by Ashish Musale At 05 April 2020 at 19:09

is there any pretrained vector for particular topic only? for example "java", so i want vectors related java in file. mean if i give input inheritance then cosine similarity show me polymorphism and other related stuff only! i am using corpus as GoogleNews-vectors-negative300.bin and Glove vectors. still not getting related words.

Original Q&A

There are 2 best solutions below

Savrige On 05 April 2020 at 23:01

Not sure if I understand your question/problem statement, but if you want to work with a corpus of java source code you can use code2vec which provides pre-trained word-embeddings models. Check it out: https://code2vec.org/

gojomo On 06 April 2020 at 17:43

Yes, you can occasionally find other groups' pre-trained vectors for download, which may have better coverage of whatever problem domains they've been trained on: both more specialized words, and word-vectors matching the word sense in that domain.

For example, the GoogleNews word-vectors were trained on news articles circa 2012, so its vector for 'Java' may be dominated by stories of the Java island of Indosnesia as much as the programming language. And many other vector-sets are trained on Wikipedia text, which will be dominated by usages in that particular reference-style of writing. But there could be other sets that better emphasize the word-senses you need.

However, the best approach is often to train your own word-vectors, from a training corpus that closely matches the topics/documents you are concerned about. Then, the word-vectors are well-tuned to your domain-of-concern. As long as you have "enough" varied examples of a word used in context, the resulting vector will likely be better than generic vectors from someone else's corpus. ("Enough" has no firm definition, but is usually at least 5, and ideally dozens to hundreds, of representative, diverse uses.)

Let's consider your example goal – showing some similarity between the ideas of 'polymorphism' and 'input inheritance'. For that, you'd need a training corpus that discusses those concepts, ideally many times, from many authors, in many problem-contexts. (Textbooks, online articles, and Stack Overflow pages might be possible sources.)

You'd further need a tokenization strategy that manages to create a single word-token for the two-word concept 'input_inheritance' - which is a separate challenge, and might be tackled via (1) a hand-crafted glossary of multi-word-phrases that should be combined; (2) statistical analysis of word-pairs that seem to occur so often together, they should be combined; (3) more sophisticated grammar-aware phrase- and entity-detection preprocessing.

(The multiword phrases in the GoogleNews set were created via a statistical algorithm which is also available in the gensim Python library as the Phrases class. But, the exact parameters Google used have not, as far as I know, been revealed.And, good results from this algorithm can require a lot of data and tuning, and still result in some combinations that a person would consider nonsense, and missing others that a person would consider natural.)

Pre trained vectors, nlp, word2vec, word embedding for particular topic?

There are 2 best solutions below

Related Questions in JAVA

Related Questions in PYTHON

Related Questions in NLP

Related Questions in WORD2VEC

Related Questions in GLOVE

Trending Questions

Popular # Hahtags

Popular Questions