Grouping profiles strings having the same words, but occurring out of order Python

61 Views Asked by At

I have a dataframe containing a column of profile types, which looks like this:

0                                    Android Java
1                  Software Development Developer
2                            Full-stack Developer
3                      JavaScript Frontend Design
4                          Android iOS JavaScript
5                             Ruby JavaScript PHP

I've used NLP to fuzzy match similar profiles, which returned the following similarity dataframe:

left_side                       right_side                  similarity
7   JavaScript Frontend Design  Design JavaScript Frontend  0.849943
8   JavaScript Frontend Design  Frontend Design JavaScript  0.814599
9   JavaScript Frontend Design  JavaScript Frontend         0.808010
10  JavaScript Frontend Design  Frontend JavaScript Design  0.802881
12  Android iOS JavaScript      Android iOS Java            0.925126
15  Machine Learning Engineer   Machine Learning Developer  0.839165
21  Android Developer Developer Android Developer           0.872646
25  Design Marketing Testing    Design Marketing            0.817195
28  Quality Assurance           Quality Assurance Developer 0.948010

While this has helped, taking me from 478 unique profile to 461, what I'd want to focus on are profiles like this:

Frontend Design JavaScript  Design Frontend JavaScript

The only tool I've seen which looks to address this problem is difflib? My question is, what other techniques would be available so as to go through and standardize these profiles that consist of the same words, but out of order, to one standard string. So desired output would be, taking a string containing "Design", "Frontend" and "JavaScript" and replacing it with "Design Frontend JavaScript".

Right now, I'm merging my original dataframe with the similarity dataframe to replace all occurrences of profile string on the right_side with the left_side, but that means I'm replacing the right_side below ("Java Python Data Science") with the left_side below ("JavaScript Python Data Science").

53  JavaScript Python Data Science  Java Python Data Science

Any help would be greatly appreciated!!!

EDIT*** I have the following written to replace all words occurring in both words_to_keep and the clean_talentpool['profile'] column, but this doesn't seem to be working? Would someone kindly point out what I'm not seeing? I would really appreciate it!

def standardize_word_order(row):
    words_to_keep = [
        "javascript frontend design",
        "android ios javascript",
        "android developer developer",
        "android developer",
        "quality assurance",
        "quality assurance engineer",
        "architecture developer",
        "big data architecture developer",
        "data architecture developer",
        "software architecture developer",
        "javascript python data science",
        "frontend php javascript",
        "javascript android ios",
        "frontend design javascript",
        "java python data science",
        "javascript frontend android",
        ".net javascript frontend",
    ]
    for word in words_to_keep:
        if (sorted(word.replace(" ", ""))) == sorted(
            row.replace(" ", "")
        ) and word != row:
            row.replace(row, word)
    return row

clean_talentpool["profile"] = clean_talentpool["profile"].apply(
    lambda x: standardize_word_order(x)
)
1

There are 1 best solutions below

1
On

In you're case i wouldn't focus on string but characters. Basically if the two string are composed by the same characters (permutated) they match.

a = "Frontend Design JavaScript"
b = "Javascript Frontend Design"

sorted(a) == sorted(b)
#prints True

You may consider removing space and do other preprocessing such as lowercasing.

if sorted(a.lower().replace(" ","")) == sorted(b.lower().replace(" ","")):
    # they are the same, do something

According to you're example an implementation may be:

def standardize_word_order(row):
    words_to_keep = [
        "javascript frontend design",
        "android ios javascript",
        "android developer developer",
        "android developer",
        "quality assurance",
        "quality assurance engineer",
        "architecture developer",
        "big data architecture developer",
        "data architecture developer",
        "software architecture developer",
        "javascript python data science",
        "frontend php javascript",
        "javascript android ios",
        "frontend design javascript",
        "java python data science",
        "javascript frontend android",
        ".net javascript frontend",
    ]
    for word in words_to_keep:
        if ((sorted(word.replace(" ", ""))) == sorted(
            row.replace(" ", "")
        ) and word != row):
            return word
    return row

clean_talentpool["profile"] = standardize_word_order(clean_talentpool["profile"])