Converting Similar names to canonical form in SQL

260 Views Asked by At

I have a database of books. There can be situations where an author can use variations of their names as author.

For eg. Albert Einstein can be written as Albert E. or A. Einstein

Now I need to create a method such that SQL database and tell me the ID of the author based on such situations.

Is there a built in way to convert these multiple variations of a name into a single type within SQL language such as PostgreSQL?

Or Should I change the name using some another language such as C, C++ or Python to convert the name into a particular form and then send to database?

I would rather have a method within SQL instead of using some another language

3

There are 3 best solutions below

2
Laurenz Albe On

You could use PostgreSQL full text search with a thesaurus dictionary. The configuration file would then contain

Albert E. : Albert Einstein
A. Einstein : Albert Einstein
0
JohannesB On

Wikipedia says:

Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling.[1] The algorithm mainly encodes consonants; a vowel will not be encoded unless it is the first letter. Soundex is the most widely known of all phonetic algorithms (in part because it is a standard feature of popular database software such as DB2, PostgreSQL,[2] MySQL,[3] SQLite,[4] Ingres, MS SQL Server,[5] Oracle.[6] and SAP ASE.[7]) Improvements to Soundex are the basis for many modern phonetic algorithms

https://en.wikipedia.org/wiki/Soundex

0
Luuk On

When doing something like this:

select 
   id, 
   (Author) as Author,
   sum(case when length(name)>2 
            then 1 
            else 0 end) score
from 
   books 
   left join unnest(pg_catalog.string_to_array('Albert Einstein', ' ')) name
         on Author like '%'||name||'%'
group by id
order by score DESC

The unnest() returns the parts Albert and Einstein from the name Albert Einstein, and a score s counted. Higher the scores are most likely a match.

It returns:

id author score
1 Albert Einstein 2
2 A. Einstein 1
3 Albert E. 1
4 Anne Ellit 0
6 Arthur Eddington 0
5 Allan W. Eckert 0

For complete SQL, see DBFIDDLE

There is, of course, room for improvement in this script because an author like 'Albert Camus' or 'Siegfried Einstein' also get a score of 1 in this code.