Why my ELMo-CNN model gives worse performance than Word2vec?

148 Views Asked by At

I want to compare the performance between ELMo and word2vec as word embedding using the CNN model by classifying 4000 tweet data on five class labels, but the results show that ELMo gives worse performance than word2vec.

I used ELMoformanylangs for ELMo and pretrained 1 million tweets for word2vec

Curve loss of word2vec-cnn

Curve loss of ELMo-cnn

It shows that the 2 models are overfitting, but why can ELMo be worse than word2vec?

1

There are 1 best solutions below

0
gojomo On

From the elmoformanylangs project you've linked, it looks like your generic ELMo model was trained on "on a set of 20-million-words data randomly sampled from the raw text released by the shared task (wikidump + common crawl)".

Given that many tweets are larger than 20 words, your 1-million-tweets training set for word2vec might be larger training data than was used for the ELMo model. And, coming from actual tweets, it may also reflect words/word-senses used in tweets better than generic wikidump/common-crawl text.

Given that, I'm not sure why you'd have expected the ELMo approach to necessarily be better.

But also, as you've noted, the fact that your classifier is performing worse with more training is highly indicative of extreme overfitting. You may want to fix that before attempting to reason any further about the relative merits of different approaches. (When both classifiers are massively broken, exactly why one's brokenness is a bit better than the others' brokenness should be a fairly moot point. After they're both fixed to do as well as they can, then the remaining difference may be interesting to choose between, or understand deeply.)