I have a db table persons in which only persons details are captured. Say Name, Father Name, Email, DOB, Proof of Address, Proof of Identity ,Pincode etc. and I have an address table in which address of the persons are stored say Address, Pincode, City, Province. Persons and Address have a foreign key relationship with column addressId(persons) and Id(address).
NOTE: Using Postgres
Now what I have to do say I am given with some proof of address id and I have to find how much such persons with same POA or POI are there and with different id if somewhere Name, Father Name and Address (all three matches).
Contacts Schema
Address Schema
| ID | Address | Pincode | City | Province |
|---|---|---|---|---|
| ad_1 | mini market 850 vijay street | 110056 | XYZ | ABC |
| ad_2 | 45 street south coast near Bank | 285006 | MNO | DEF |
| ad_3 | 45 street south coast near Bank | 285006 | MNO | DEF |
E.g Input POA Id = poi_1
considering input poa id, we can do direct string matching query on DB.
select * from persons where poi_id = 'poi_1' or poa_id = 'poi_1';
The result is person records with Id 1,2. Now if you observe person 2 has same poi/poa id but address is different. Now we have to consider this address and do dedupe based on this address based on which person with id 3 comes in response. (Name,Fathername and address all 3 must match)
I need an optimised way to resolve the same. If I do like query on name and father name only, It will do a full db scan which is not feasible at all. Same goes for address like query.
My thought:
select * from addressTable where address like = %input address% and pincode = '285006';
I haven't thought how should i proceed bcz everytime I stuck on what basis I should do a query so that it is optimised.
**Note:**
indexes mentioned below
"person_new_dob_idx" btree (dob)
"person_new_email_idx" btree (email)
"person_fst_name_idx" btree (name)
"person_poa_idx" btree (poa_id)
"person_new_poi_idx" btree (poi_id)
"person_addr_id_idx" btree (addr_id)
"person_dob_primary_pincode_idx" btree (dob, primary_pincode)
Postgres version:-- 13.4
Points to consider:
- In the mentioned example pincode is different, but there might be possibility of different address within same pincode.
- Do care for false positives and true negatives.
- Address words might be shuffled or written same address in different way.

I'll just start by leaving these here for reference in case, my answer didn't hit home.
As far as I understand, you want to query your database (in an efficient manner) to find persons with the same proof of address or proof of identity, and then deduplicate the results based on the following triplet (name, father's name, and address).
Based on that I propose the following query
Join the
personsandaddresstables on theaddressIdandIdcolumns will allow you to access both thepersonandaddressinformation in a single query (straightforward). For address ordering, you can string similarity function such aslevenshtein()(what I used here) orpg_trgm()to compare the name, father's name, and address fields. This will allow you to find records that are similar but not necessarily identical.Results:
You can (and probably should) adjust the similarity thresholds and test different string similarity functions to fine-tune the results. You may also want to consider using an index on the
name,father_name, andaddresscolumns to speed up the query.