Fuzzy matching to find duplicates in Google Cloud Data Fusion pipeline (or, is it possible to run Python scripts within Datafusion)

21 Views Asked by Juno At 25 March 2024 at 17:14

I have an existing dataset containing customers in Big Query and will receive monthly uploads of new data. The goal is to have a step in the upload pipeline that will check between the new data and the existing data for duplicates (to find returning customers), with the goal being to have an output of 2 tables: one containing only 1 time customers and the other containing only returning customers. On the chance that a customer gives his name as 'Rob' at one location and 'Robert' at the other, it is desired that some degree of fuzzy matching is enabled when checking their first name, last name, and date of birth (also to try to fool proof against data entry errors).

For those familiar with Data Fusion, can you think of a way to do this?

I am more familiar with python, and so I can think of how I would do this there, but I am coming up blank on how to do something like this in Data Fusion. I see plugins that allow for basic python transformations in Fusion (the python evaluator transformation plugin seems promising, but it doesn't appear to allow for the more complicated fuzzy matching btwn 2 data sets). Is there any ability to run python code like Tableau Prep allows, where it just intakes and outputs a data frame?

Any, *any *solution would be greatly appreciated.

Original Q&A

Fuzzy matching to find duplicates in Google Cloud Data Fusion pipeline (or, is it possible to run Python scripts within Datafusion)

There are 0 best solutions below

Related Questions in PYTHON

Related Questions in GOOGLE-CLOUD-PLATFORM

Related Questions in GOOGLE-BIGQUERY

Related Questions in GOOGLE-CLOUD-DATA-FUSION

Trending Questions

Popular # Hahtags

Popular Questions