I'm new in machine learning. Now I want to calculate similarity between two documents in different languages (ex: a Vietnamese document and an English document).
I know about if we compare multilingual words, we can use transvec in word2vec. I want to ask if it is possible in doc2vec. How can I do to solve this problem with doc2vec? (Now I train doc2vec by gensim)
The
Doc2Vecmodel in Gensim is oblivious to languages. It just applies the very-word2vec-like 'Paragraph Vector' algorithm to learn vectors for runs-of-tokens (documents) that are helpful in predicting words, either alone (pure DBOW mode) or in combintion with nearby-word-to-nearby-word info (DM mdoes).So, whether it'd work on a multilingual corpus, for any particular purpose –such as detecting when two documents in different languages cover similar topics – will depend entirely on how you train the model, and especially the kind of documents and word-to-word correlations it sees in it training set.
While I've not run the experiments, from my understanding of the algorithm, I would expect it to possibly work if:
A model with only monolingual examples, and oversized, could tend to work great on English-to-English doc comparisons – putting all English docs in one giant region of the vector-space – and also work great on Vietnamese-to-Vietnamese doc comparisions _ putting all Vietnamese documents in an arbitrarily-different giant region of the vector-space. But even an English doc and a Vietnamese doc about the same thing could have very different vectors – because nothing in the training data ever hinted those words covered the same things.
Ultimately, though, you'd need to experiment to see how well it would work, and how much you could help it to work, by ensuring it has useful multilingual hints of cross-language topics.
Update re your Q about
transvec:I wasn't aware of that library; it looks neat, and seems it may be variant of, and possibly better than, the
TranslationMatrixmodel in Gensim.Both
transvec& Gensim'sTranslationMatrixare general tools for learning mappings between separate vector spaces, once you provide a set of known correlated anchors.As such, they could allow an alternative approach to your goal:
Doc2Vecmodel using only English documents, create a separateDoc2Vecmodel using only Vietnamese documents, and ensure they're both individually sensible – trained on enough data and giving reasonable results in ad hoc or rigorous evaluations.2 then, using some good 'gold standard' set of English & Vietnamese document pairs that "should" have the same doc-vector – as for example if they are good translations of each other – use
transvecto learn how to translate one model's vectors (either from the original training set or later inferences) into the other mdoel's space, for direct comparison to the other space's vectors.As a vague rule of thumb, I believe you'd want to have many more anchor-pairs than there are dimensions in the model. (That is: a mere 100 1-to-1 examples is unlikely to be sufficient to learn a good mapping between 2 300-dimensional spaces – there's too much extra slack/variance on each end for fewer examples to communicate – but a thousand or several thousand examples might work well.)
But of course, the real answer will come through experiments on your data, for your goals.