Criss Cross Address Deduplication

Question

Criss Cross Address Deduplication

86 Views Asked by Akhilesh mahajan At 13 August 2023 at 04:15

I have a db table persons in which only persons details are captured. Say Name, Father Name, Email, DOB, Proof of Address, Proof of Identity ,Pincode etc. and I have an address table in which address of the persons are stored say Address, Pincode, City, Province. Persons and Address have a foreign key relationship with column addressId(persons) and Id(address).

NOTE: Using Postgres

Now what I have to do say I am given with some proof of address id and I have to find how much such persons with same POA or POI are there and with different id if somewhere Name, Father Name and Address (all three matches).

Contacts Schema

Address Schema

ID	Address	Pincode	City	Province
ad_1	mini market 850 vijay street	110056	XYZ	ABC
ad_2	45 street south coast near Bank	285006	MNO	DEF
ad_3	45 street south coast near Bank	285006	MNO	DEF

E.g Input POA Id = poi_1

considering input poa id, we can do direct string matching query on DB.

select * from persons where poi_id = 'poi_1' or poa_id = 'poi_1';

The result is person records with Id 1,2. Now if you observe person 2 has same poi/poa id but address is different. Now we have to consider this address and do dedupe based on this address based on which person with id 3 comes in response. (Name,Fathername and address all 3 must match)

I need an optimised way to resolve the same. If I do like query on name and father name only, It will do a full db scan which is not feasible at all. Same goes for address like query.

My thought:

select * from addressTable where address like = %input address% and pincode = '285006';

I haven't thought how should i proceed bcz everytime I stuck on what basis I should do a query so that it is optimised.

**Note:**

indexes mentioned below 
"person_new_dob_idx" btree (dob) 
"person_new_email_idx" btree (email) 
"person_fst_name_idx" btree (name) 
"person_poa_idx" btree (poa_id) 
"person_new_poi_idx" btree (poi_id) 
"person_addr_id_idx" btree (addr_id) 
"person_dob_primary_pincode_idx" btree (dob, primary_pincode) 

Postgres version:-- 13.4

Points to consider:

In the mentioned example pincode is different, but there might be possibility of different address within same pincode.
Do care for false positives and true negatives.
Address words might be shuffled or written same address in different way.

Original Q&A

There are 1 best solutions below

**PGzlan** · Answer 1 · 2023-08-13T09:51:42.000000

I'll just start by leaving these here for reference in case, my answer didn't hit home.

-- Create Address Table
CREATE TABLE address (
    Id SERIAL PRIMARY KEY,
    Address TEXT NOT NULL,
    Pincode TEXT NOT NULL,
    City TEXT NOT NULL,
    Province TEXT NOT NULL
);

-- Create Person Table
CREATE TABLE persons (
    Id SERIAL PRIMARY KEY,
    Name TEXT NOT NULL,
    Father_Name TEXT NOT NULL,
    Email TEXT NOT NULL,
    DOB DATE NOT NULL,
    POA_Id TEXT NOT NULL,
    POI_Id TEXT NOT NULL,
    Pincode TEXT NOT NULL,
    AddressId INTEGER REFERENCES address(Id)
);

-- Insert statements for data provided
INSERT INTO address (Address, Pincode, City, Province)
VALUES
    ('mini market 850 vijay street', '110056', 'XYZ', 'ABC'),
    ('45 street south coast near Bank', '285006', 'MNO', 'DEF'),
    ('45 street south coast near Bank', '285006', 'MNO', 'DEF');

INSERT INTO persons (Name, Father_Name, Email, DOB, POA_Id, POI_Id, Pincode, AddressId)
VALUES
    ('PN1', 'FN1', '[email protected]', '2000-01-01', 'poi_1', 'poa_2', '110056', 1),
    ('PN1', 'FN1', '[email protected]', '2000-01-01', 'poi_1', 'poi_1', '285006', 2),
    ('PN1', 'FN1', '[email protected]', '2000-01-01', 'poi_3', 'poa_3', '285006', 3);

As far as I understand, you want to query your database (in an efficient manner) to find persons with the same proof of address or proof of identity, and then deduplicate the results based on the following triplet (name, father's name, and address).

Based on that I propose the following query

-- Import fuzzy matching to add some flexibility to matching addresses
-- (https://www.postgresql.org/docs/9.1/fuzzystrmatch.html)
CREATE EXTENSION fuzzystrmatch;

-- Proposed query
SELECT p1.*
FROM persons p1
JOIN address a1 ON p1.addressId = a1.Id
WHERE (p1.poi_id = 'poi_1' OR p1.poa_id = 'poi_1')
AND EXISTS (
    SELECT 1
    FROM persons p2
    JOIN address a2 ON p2.addressId = a2.Id
    AND p1.Id != p2.Id
    AND levenshtein(p1.name, p2.name) <= 3
    AND levenshtein(p1.father_name, p2.father_name) <= 3
    AND levenshtein(a1.address, a2.address) <= 10
);

Join the persons and address tables on the addressId and Id columns will allow you to access both the person and address information in a single query (straightforward). For address ordering, you can string similarity function such as levenshtein() (what I used here) or pg_trgm() to compare the name, father's name, and address fields. This will allow you to find records that are similar but not necessarily identical.

Results:

You can (and probably should) adjust the similarity thresholds and test different string similarity functions to fine-tune the results. You may also want to consider using an index on the name, father_name, and address columns to speed up the query.

Criss Cross Address Deduplication

There are 1 best solutions below

Related Questions in STRING

Related Questions in POSTGRESQL

Related Questions in GO

Related Questions in PYTHON-DEDUPE

Related Questions in GO-GIN

Trending Questions

Popular # Hahtags

Popular Questions