Our users will give a 2 to 3 sentence description about their profession.
Example user A (profile description): I am a data scientist living in Berlin, I like Japanese food and I am also interested in arts.
Then they also give a description about what kind of person they are looking for.
Example user B (looking for description): I am looking for a data scientist, sales guy and an architect for my new home.
We want to match these on the basis that user A is a data scientist and user B is looking for a data scientist.
At first we required the user to hand select the tags they want to be matched on. And example of the kind of tags we provided:
Environmental Services
Events Services
Executive Office
Facilities Services
Human Resources
Information Services
Management Consulting
Outsourcing/Offshoring
Professional Training & Coaching
Security & Investigations
Staffing & Recruiting
Supermarkets
Wholesale
Energy & Mining
Mining & Metals
Oil & Energy
Utilities
Manufacturing
Automotive
Aviation & Aerospace
Chemicals
Defense & Space
Electrical & Electronic Manufacturing
Food Production
Industrial Automation
Machinery
Japanese Food
...
This system kinda works but we have a lot of tags and want to create more 'distant' relations.
So we need:
- to know which parts are important, we could use POS-tagging for this, to extract the 'data science', 'japanese food' etc?
- and then compare the vectors of each part; e.g. 'data science' with 'statistics' is a good match, and 'japanese food' and 'asian food' is a good match.
- and set a threshold.
- and this should result in a more convenient way of matching right?
It's essential to first clarify what "importance" means in this context. From the given example, it appears that matching based on job title is the goal, but there could be other criteria like location, interests, etc. To extract relevant phrases or entities from the text, you could employ POS (Part-of-Speech) tagging or Named Entity Recognition (NER) tagging or even relation extraction (like what OpenIE package does) techniques.
The subsequent step involves matching instances based on the significant phrases or entities extracted. For this, semantic matching methods like Cosine Similarity can be used. However, before applying Cosine Similarity, you'll need to convert these phrases into vector representations. Starting with Word2Vec (W2V) or GloVe embeddings is a good idea, and you may also explore modern contextualized models like BERT or RoBERTa, which currently represent the state-of-the-art in representation learning.
For aspects like thresholding, a trial-and-error approach could be beneficial. Begin with a predefined similarity threshold, and then adjust this value based on the outcomes of your testing and the quality of matches observed. This iterative adjustment can help fine-tune the matching process to achieve better results.