Phonetic Algorithms for Postgresql

624 Views Asked by jfbaro At 10 November 2021 at 18:17

please, I am working on a PoC for Person Real-time Identification, and one of the critical aspects of it is to support both minor misspelling and phonetic variations of First, Middle, and Last name. Like HarinGton == HarrinBton or RaphEAl == RafAEl. It's working for longer names, but it's a bit more imprecise for names like Lee and John.

I am using Double Metaphone through dmetaphone() and dmetaphone_alt() in PostgreSQL 13.3 (Supabase.io). And although I appreciate Double Metaphone it has a (too?) short string as the outcome. metaphone() has parameters to make the resulting phonetic representation longer. I investigated dmetaphone() and couldn't find anything other than the default function.

Is there a way of making dmetaphone() and dmetaphone_alt() return a longer phonetic representation similar to metaphone()'s, but with a ALT variation?.

Any help would be much appreciated.

Thanks

Original Q&A

There are 1 best solutions below

lys On 18 December 2021 at 09:06 BEST ANSWER

Looking at the postgres docs for these features you don't have parametric control over the length of the encoded string for Double Metaphone. In the case of single Metaphone, you can only truncate the output string:

max_output_length sets the maximum length of the output metaphone code; if longer, the output is truncated to this length.

However you may get much better results by using Trigram Similarity or Levenshtein Distance on the encoded output from either of the metaphone methods - this can be a more powerful way to handle phonetic permutations using Metaphones.

Example

Consider all the spelling permutations possible for the artist Cyndi Lauper, using double metaphone with trigram similarity we can achieve 100% similarity between the incorrect string cindy lorper and the correct spelling:

SELECT similarity(dmetaphone('cindy lorper'), dmetaphone('cyndi lauper'));

yields: similarity real: 1 (ie: 100% similarity)

Which means the encodings are identical for both input strings using Double Metaphone. When using Metaphone, they're slightly different. All of the following yield SNTLRPR

SELECT metaphone('cyndy lorper',10);
SELECT metaphone('sinday lorper', 10);
SELECT metaphone('cinday laurper', 10);

SELECT metaphone('cyndi lauper',10);

yields: SNTLPR which is only one character different to SNTLRPR

You can also use Levenshtein Distance to calculate it, which gives you a filterable parameter to work with:

SELECT levenshtein(metaphone('sinday lorper', 10), metaphone('cyndi lauper', 10));

yields: levenshtein integer: 1

It's working for longer names, but it's a bit more imprecise for names like Lee and John.

It's a bit difficult to see exactly what you're having trouble with - without a more complete reprex.

SELECT similarity(dmetaphone('lee'), dmetaphone('leigh'));
SELECT similarity(dmetaphone('jon'), dmetaphone('john'));

both yield: similarity real: 1 (ie: 100% similarity)

Edit: here's a easy to follow guide for fuzzy matching with postgres

Phonetic Algorithms for Postgresql

There are 1 best solutions below

Example

Related Questions in POSTGRESQL

Related Questions in PLPGSQL

Related Questions in PHONETICS

Related Questions in SUPABASE

Trending Questions

Popular # Hahtags

Popular Questions