Fuzzy-Score batch of records against existing table

92 Views Asked by At

I'm trying to create a mechanism by which I can score each record in a data set based on their similarity to existing records in my database. My objective here is to identity duplicates, or potential/likely duplicates in my new list, by giving them a score that indicates their similarity to a record that already exists in my table.

The fields I will be using to identify likely matches are CompanyName, AddressLine1, Postcode, TelephoneNumber.

I've done some research and found a number of methods that all work well on small numbers of records (including various edit distance methods, trigram comparisons, Double Metaphone Scoring) but am struggling on how to implement these, or similar methods, on larger batches of records. The algorithms I've found are all scalar functions, which may be the root of my problem. Unfortunately, my expertise in this area is limited so I don't really know where to go from here.

To give an idea on numbers: I have c.120,000 records in my table and around 25,000 that need to be scored.

I'd be grateful of any pointers for further research and example methods on how I can approach my task and come up with a solution that doesn't take hours to execute!

Please let me know if more detail is required here.

1

There are 1 best solutions below

1
On

I've done this several times so I can give you some pointers.

The key point is that the full comparison comparing every pair of fields using e.g. Levenshtein distance will be computationally expensive. Therefore you need to avoid doing so as much as possible.

For example, with 120,000 records to be matched against 25,000 records, that's 3 billion comparisons. If you can do each comparison in 1ms, that's still going to take a month.

So the first thing to do is look at your actual requirements. Since you haven't given any, I'll make them up.

Requirements

Report all pairs, where at least two of the given fields are a close match.

Close match is defined by:

  • Telephone: Similarity is number of digits in the same position in common. Fewer than area code + first 4 digits is a fail.

  • Postcode: Similarity is number of leading digits in common. Fewer than 3 is a fail.

  • Address1: Similarity is words in common

  • Company name: Similarity is Levenstein distance.

General Approach

So the trick is to avoid as much work as possible. For each matching test you want to break it (if possible) into a cheap "broad phase" and the full "narrow phase" test. The broad phase discards as many possibilities as it can without doing too much work, and only those which pass that test go to the narrow phase.

  • The most expensive check is company name, so only do that if at least one other field matches.

  • Postcode is based on leading digits, so we can use an index for this.

  • Telephone numbers can be pre-processed, to remove the Extension if any, and extract the number +area code. Once done, an index can be used.

Create candidate tables for each test

So Create a table of "Candidate Telephone Matches" with the PK of each pair where the Area+4 digits match. This needn't be the full similarity score, it should be something which is guaranteed to succeed if the similarity test succeeds, but which discards a high proportion of pairs which would fail.

Do the same for Postcode

For Address ("words in common") you need to preprocess to create an "Address words" table, which you can then query to get candidates.

Once you have three candidate tables, you can disregard any input record which doesn't appear in at least one (since you require two matches).

The final candidate table can be populated with the union of the pairs from the previous three, and the expensive Levenstein distance can be calculated only on those pairs.

Put it all together

Populate your final candidate table with all the rows which appear in the other tables, together with a flag to say which tests were met. You can now perform the full narrow-phase test on each matching dimension.

In this way you should be able to get that 1 month timescale down to a couple of hours or even a few minutes.