I would like to use the reclin2 package or any other record matching package available in R in order to match reported transactions from each 2 market participants based on several numerical and categorical attributes (quantity, price, location, address…).
I wonder if there is a possiblity to use an R package and integrate a user-defined function for calculating the similarity per attribute between two records? For example, for comparing the values 90 and 100, is it possible to calculate the similarity as a percentage 90/100 = 0.9 instead of FALSE as they do not exactly match.
For example, all Report_IDs within the same Block_ID should be compared and a similarity score should be calculated.
reported transactions:
Data
structure(list(Report_ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
Block_ID = c(1, 1, 1, 2, 2, 2, 3, 3, 3, 3), Number = c(5,
3, 2, 10, 11, 100, 2, NA, 8, 2), Character = c("A", "A",
"B", "A", "B", "C", "D", "A", "D", "A")), class = "data.frame",
row.names = c(NA,
\-10L))
| Report_ID | Block_ID | Number | Character |
|---|---|---|---|
| 1 | 1 | 5 | A |
| 2 | 1 | 3 | A |
| 3 | 1 | 2 | B |
| 4 | 2 | 10 | A |
| 5 | 2 | 11 | B |
| 6 | 2 | 100 | C |
| 7 | 3 | 2 | D |
| 8 | 3 | #NA | A |
| 9 | 3 | 8 | D |
| 10 | 3 | 2 | A |
A result df could look like this: similarity dataframe
| Report_ID_combination | Similarity_Number | Similarity_Character |
|---|---|---|
| 1_2 | 0.60 | 1 |
| 1_3 | 0.40 | 0 |
| 2_3 | 0.67 | 0 |
| 4_5 | 0.91 | 0 |
| 4_6 | 0.10 | 0 |
| 5_6 | 0.11 | 0 |
| 7_8 | #NA | 0 |
| 7_9 | 0.25 | 1 |
| 7_10 | 1.00 | 0 |
| 8_9 | #NA | 0 |
| 8_10 | #NA | 1 |
| 9_10 | 0.25 | 0 |
As far as I know the typical to use packages for record matching/deduplication (reclin2, RecordLinkage, StatMatch, fedmatch) provide several distance measure for categorial features (Jaccard, Levenshtein, Cosine...), however for relative numerical comparison incl. NA handling I have not found a suitable package yet.
Any hint from someone working in the R/record matching universe on this kind of issue is appreciated.
Thanks,
Marius
Use the following: