I want to compare the performance between ELMo and word2vec as word embedding using the CNN model by classifying 4000 tweet data on five class labels, but the results show that ELMo gives worse performance than word2vec.
I used ELMoformanylangs for ELMo and pretrained 1 million tweets for word2vec
It shows that the 2 models are overfitting, but why can ELMo be worse than word2vec?
From the
elmoformanylangsproject you've linked, it looks like your generic ELMo model was trained on "on a set of 20-million-words data randomly sampled from the raw text released by the shared task (wikidump + common crawl)".Given that many tweets are larger than 20 words, your 1-million-tweets training set for
word2vecmight be larger training data than was used for the ELMo model. And, coming from actual tweets, it may also reflect words/word-senses used in tweets better than generic wikidump/common-crawl text.Given that, I'm not sure why you'd have expected the ELMo approach to necessarily be better.
But also, as you've noted, the fact that your classifier is performing worse with more training is highly indicative of extreme overfitting. You may want to fix that before attempting to reason any further about the relative merits of different approaches. (When both classifiers are massively broken, exactly why one's brokenness is a bit better than the others' brokenness should be a fairly moot point. After they're both fixed to do as well as they can, then the remaining difference may be interesting to choose between, or understand deeply.)