Using stringdist in R with big dataset (1.8 millions rows)?

208 Views Asked by At

I'm working with a dataset(df) which contains a column call job, where people just enter their job position.

The problem is because the data is typed manually so they contains a lot of misspelling errors. To do some calculations grouping by job, I'm trying to create a column called group, to group job with similar string together. For example:

Job Jobgroup
Bartender Bartender
Barttender Bartender
Batendere Bartender
Engineer Engineer
Enginer Engineer

The jobgroup will be created base on the string distance method (jw method, in detail). I tried two appoach which give me quite the desired results. 1 is running a loop as follow:

library(stringdist)
for (i in seq(1:nrow(df))){
     for (j in seq(i:nrow(df))){
         if (df$group[j]=="nogroup" & ){ #space correct
                if (stringdist(df$job[i],df$job[j],method="jw")<0.10){
                       df$group[j] <- df$group[i]
          }
       }
    }
}

2 is using hierarchical classification using string distance with hclust() function. The 1st step of this one is to create a distance matrix(which won't work if I have 1.8mil rows) The problem is my dataset contains around 1.8 millions rows so both two approach above won't finish in even hours.

So I'm here looking for any ideas, propositions and experiences that can help me.

1

There are 1 best solutions below

2
On

Comparing each job position with every other job position would be to slow without using parallelization or optimized software as elasticsearch

maybe you could try on of the three following Ansatz:

  1. If the number of groups would be less than 100 you can define the groups per hand an compute the distance between the groups an each job position.

  2. As the job position are more or less cluster in a space (the same assumption because you decided to use hclust) you can try to calculate the occurrence of each letter in each job position and compare these numbers to get an approximation of the groups which may be accurate.

  3. When you mix the first two you can start defining one or two job position, calculate each distance between these two an each job position and find the other group members. By repeating defining new groups for the not associated job positions you can iterative find out the groups