Doc2Vec most similar vectors don't match an input vector

503 Views Asked by At

I've got a dataset of job postings with about 40 000 records. I extracted skills from descriptions using NER with about 30 000 skills in the dictionary. Every skill is represented as an unique identificator.

The distribution of skills number for a posting looks like that:

mean 15.12 | std 11.22 | min 1.00 | 25% 7.00 | 50% 13.00 | 75% 20.00 |

I've trained a word2vec model using only skill ids and it works more or less fine. I can find most similar skills to a given one and the result looks okay.

But when it comes to a doc2vec model I'm not satisfied with the result.

I have about 3200 unique job titles, most of them have only few entries and there are quite a few of them being from the same field ('front end developer', 'senior javascript developer', 'front end engineer'). I delibirately went for a variety of job titles which I use as tags in doc2vec.TaggedDocument(). My goal is to see a number of relevant job titles when I input a vector of skills into docvecs.most_similar().

After training a model (I've tried different number of epochs (100,500,1000) and vector sizes (40 and 100)) sometimes it works correctly, but most of the time it doens't. For example for a skills set like [numpy, postgresql, pandas, xgboost, python, pytorch] I get the most similar job title with a skill set like [family court, acting, advising, social work].

Can it be a problem with the size of my dataset? Or the size of docs (I consider that I have short texts)? I also think that I misunderstand something about doc2vec mechanism and just ignore it. I'd also like to ask if you know any other, maybe more advanced, ideas how I can get relevant job titles from a skill set and compare two skill set vectors if they are close or far.

UPD:

Job titles from my data are 'tags' and skills are 'words'. Each text has a single tag. There are 40 000 documents with 3200 repeating tags. 7881 unique skill ids appear in the documents. The average number of skill words per document is 15.

My data example:

         job_titles                                             skills
1  business manager                 12 13 873 4811 482 2384 48 293 48
2    java developer      48 2838 291 37 484 192 92 485 17 23 299 23...
3    data scientist      383 48 587 475 2394 5716 293 585 1923 494 3

The example of my code:

def tagged_document(df):
    #tagging documents
    for index, row in df.iterrows():
        yield gensim.models.doc2vec.TaggedDocument(row['skills'].split(), [row['job_title']])


data_for_training = list(tagged_document(job_data[['job_titles', 'skills']])

model_d2v = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=100)

model_d2v.train(data_for_training, total_examples=model_d2v.corpus_count, epochs=model_d2v.epochs)

#the skill set contains close skills which represent a front end developer
skillset_ids = '12 34 556 453 1934'.split()                                                  
new_vector = model_d2v.infer_vector(skillset_ids, epochs=100)
model_d2v.docvecs.most_similar(positive=[new_vector], topn=30)

I've been experimenting recently and noticed that it performs a little better if I filter out documents with less than 10 skills. Still, there are some irrelevant job titles coming out.

1

There are 1 best solutions below

6
On BEST ANSWER

Without seeing your code (or at least a sketch of its major choices), it's hard to tell if you might be making shooting-self-in-foot mistakes, like perhaps the common "managing alpha myself by following crummy online examples" issue: My Doc2Vec code, after many loops of training, isn't giving good results. What might be wrong?

(That your smallest number of tested epochs is 100 seems suspicious; 10-20 epochs are common values in published work, when both the size of the dataset and size of each doc are plentiful, though more passes can sometimes help with thinner data.)

Similarly, it's not completely clear from your description what your training docs are like. For example:

  • Are the tags titles and the words skills?
  • Does each text have a single tag?
  • If there are 3,200 unique tags and 30,000 unique words, is that just 3,200 TaggedDocuments, or more with repeating titles?
  • What's the average number of skill-words per TaggedDocument?

Also, if you are using word-vectors (for skills) as query vectors, you have to be sure to use a training mode that actually trains those. Some Doc2Vec modes, such as plain PV-DBOW (dm=0) don't train word-vectors at all, but they will exist as randomly-initialized junk. (Either adding non-default dbow_words=1 to add skip-gram word-training, or switching to PV-DM dm=1 mode, will ensure word-vectors are co-trained and in a comparable coordinate space.)