I previously asked a question here about how to use R to automatically "spellcheck" a big list of department names before I export a file and send it off. (Same data can be used as reproducible example)
The solution of using Fuzzy Join worked perfectly and 99% of the time its exactly what I need. Here's an example of when it works great:
As you can see, it needs to look like Hematology/Oncology and it previously looked like Hematology Oncology. Fuzzy Join figured it out great.
The problem comes when one of the inputs is just too far off and fuzzy join can't figure it out (I apologize, this example wasn't in the reproducible data, its from my real data):
In this example, Fuzzy join just couldn't figure it out and suggested "Sleep lab" when someone wrote "IS".
Due to the nature of my real data, there's going to be a lot of this. So my question is:
Either WHEN Fuzzy join does the joining in this code:
final_df <- stringdist_join(df, df2,
by = "ManagementGroup",
mode = "left",
ignore_case = FALSE,
method = "jw",
max_dist = 99,
distance_col = "dist") %>%
group_by(ManagementGroup.x) %>%
slice_min(order_by = dist, n = 1) %>%
distinct()
or afterwards, before I export it using:
write_csv(final_df, "finaldf.csv")
Can I have R warn me that there were matches over a certain threshold of "dist" and filter that row out of the results and put it into a separate data frame? At least with a 'warning' but ideally even an audible warning using "beepr" or something.
My end goal is that R will automatically handle 99% of the cases and I might have to manually input 1 or 2 that were just too misspelled etc.. but I'll be warned that I need to do that.