I am working on a record linkage problem and applying unsupervised algorithm since I do not have external labels.
I have applied ECM alogorithm. Code used is:
import recordlinkage
indexer = recordlinkage.BlockIndex(on=['FirstName_CD','LastName_CD'])
pairs = indexer.index(data1, data2)
compare_cl = recordlinkage.Compare()
compare_cl.string('FirstName_CD', 'FirstName_CD', method='jarowinkler', threshold=0.50,label='given_name')
compare_cl.string('LastName_CD', 'LastName_CD', method='jarowinkler', threshold=0.50, label='surname')
compare_cl.exact('Date.Of.Birth_CD', 'Date.Of.Birth_CD', label='date_of_birth')
compare_cl.exact('Gender_CD', 'Gender_CD', label='gender')
compare_cl.exact('Profession_CD', 'Profession_CD', label='profession')
compare_cl.string('Address_CD', 'Address_CD', threshold=0.85, label='address_1')
features = compare_cl.compute(pairs,data1)
ecm = recordlinkage.ECMClassifier()
result_ecm=ecm.learn(features)
Now it returns a multiindex. My question is what inference I can draw from it? How to get the matches/mismatches information?
In order to get ECM classifier work on 'compare vectors' OR 'features' is to fit the model only on the columns which has unique>1.
Here is the python code: