Solve Record Linkage as a Constraint Satisfaction with Machine Learning

329 Views Asked by At

I have pairs of sets such as

A = { L, M, N, P } = { <"Lll", 47, 0.004>, <"Mm", 60, 0.95>,  <"Nnnn", 33, 0.2892>,  <"P", 47, 0.0125> }
B = { l, m, n, o } = { <"l", 46, 0.004>, <"m", 0, 0.95>,  <"nn", 33, 0.2892>,  <"oOo", 33, 0.5773> }

... and I want to automatically train an algorithm based on known-good data to know how to link the set members as

{ <L, l>, <M, m>, <N, n>, <?, o>, <P, ?> }

... with, at most, one match for each element of either set. The sets do not have to have the same size and have no guarantees about their overlap - maybe no matches, maybe all matches, maybe a mix of matches and non-matches. But there is expected to be a human-identifiable matching in many cases and the computer should approximate it.

Tried so far

H(a, b, w1, w2, w3) scores a pair of tuples <a1, a2, a3> from A and <b1, b2, b3> from B as f1(a1, b1) * w1 + f2(a2, b2) * w2 + f3(a3, b3) * w3 where f1, f2, and f3 are hand-crafted and w1, w2, and w3 are parameterized weights. I sort all pairs A × B by their scores and take the pairs for which neither member is already represented by a higher-scored pair. I use a crude hill-climbing to train for the weights so that the resulting pairs map as the training data expects. A perfect weighting configuration has a threshold t which delineates correct pair scores S_ab from incorrect pair scores. This algorithm routinely finds perfect configurations after a few hundred or thousand iterations for my training data of about 800 (A, B) sets totaling 2500 pairs of 8-uples (instead of the 3-uples illustrated). I have yet to give it a validation dataset to find out how badly this method is overfitting.

I'm not happy about the hardcoded treatment of the set-ness aspect of the problem. I can only imagine machine learning techniques for scoring pairs but the subsequent mapping is hand-crafted and perhaps isn't as smart as an ideal solution that considers the set-mapping as a whole. Because the machine learning part doesn't consider the whole set, it seems to me to be missing out on some information it could be using to make better decisions.

I think my illustration above could be refactored to first score all pairs in A × B as S_ab = < f1(a1, b1), f2(a2, b2), ..., fn(an, bn) > (for n-tuples) and then use an [n, ?, 1] neural network training on matches and non-matches by each S_ab. This considers a pair and outputs match/no match and does nothing to consider the whole set.

It is my understanding that neural networks don't handle variable-sized input, though perhaps I could choose an upper-bound for ||A|| and ||B|| and find some neutral encoding for padding unused nodes. And the output could be a matrix of matches along the axes indexing the elements of A along the side and B along the bottom, say. But then still the net would be sensitive to the order of elements, no?

So ...

Is there a machine learning technique that could reliably map sets to sets in this way? It is related to record linkage in obvious ways. It is a constraint satisfaction problem in that each element can be matched at most once. It would be ideal if human corrections of results could be incorporated as feedback for improved future results. If you have a way, could please spell it out for me because I'm not well versed in machine learning concepts.

0

There are 0 best solutions below