I'm using pandas-dedupe to link a dataframe with mispellings to another with record-level info. Here is a much simplified example:
df1 = pd.DataFrame({'a': ['cat', 'dog', 'frog', 'mouse', 'snake'], \
'info': ['mammal', 'mammal', 'amphibian', 'mammal', 'reptile']})
df2 = pd.DataFrame({'a': ['caat', 'mous', 'dog', 'xfrogg', 'snak', 'xyzgiraff']})
I have separate training data in csv file, which looks like this:
df3 = pd.DataFrame({'orig': ['caat', 'mous', 'dog'], 'correct':['cat', 'mouse', 'dog']})
How can I pass the labels in df3 as the training data in my call to pandas_dedupe.link_dataframes? I've tried reading the dedupe documentation, but I'm not sure how to format df3 so that I can pass it as training data.
My suggestion is to create labels using pandas-dedupe rather than passing your own labels into link_dataframes.
pandas-dedupe saves settings and labels into a *_settings and *_training.json file respectively. However, I would not encourage to add your labels to the file since you might create a mismatch between training and settings file.