I have a two tables with ~70 000 rows. Both of tables have a column "title". I need to compare to tables and find intersections of them by title column. I try to use JOIN and UNION, but titles can be little different. I mean, in one table it can be New-York, USA but in other it can be New York Usa. I googled it, and it calls "fuzzy string searching".
I already start with php, and similar_text, but it's very slow... I think that for this task I should use something else, like a R maybe.. I already push this data into BigQuery, but BigQuery support only REGEXEP for search in where statement, or I can't understand how it should be used.
Is R can solve my problems with speed?
Thanks!
Example of dataset1
:
new-york, usa|100|5000
dataset2
:
newyork usa|50|1000
nnNew-York |10|500
Example of desired output:
New-York, Usa|160|6500
In other words, I need create new table that will contain data from both tables.
UPDATED
Thanks for your answers, I tried R and agrep, it works, but very slowly..2 000 rows in 40 minutes, I have a 190 000 rows totally. Is it normal?
The answer to your question is "Levenshtein distance". However, with 70,000 rows, this requires approximately 70,000*70,000 comparisons -- 490 million. That is a lot.
Doing the work in R may be your best approach, because R will keep all the data in memory and probably be more efficient than an implementation in MySQL.
There are ways to short-circuit the searching. One method, for instance, is to divide each string into n-grams (trigrams are typical) and use these to reduce the search space. After all, "New York City" and "Dallas" have no letters in common, so no comparison really needs to be done.
There are probably routines in R to handle trigrams. They are pretty easy to do in MySQL, but not built-in.