I have 2 users table in different databases and I would like to get only unique rows from one those tables.
On the following example, I need the list of emails which have no duplicate name.
I am using Pentaho DI Kettle.
Table Users from database 1
ID | Name | Email
--- | ----------- | -------------
1 | Jonh Snow | [email protected]
--- | ----------- | -------------
2 | Sansa Stark | [email protected]
--- | ----------- | -------------
3 | Ayra Stark | [email protected]
Table Users from database 2
ID | Name | Email
--- | ----------- | -------------
1 | Jonh Stott | [email protected]
--- | ----------- | -------------
2 | Jonh Jonh | [email protected]
--- | ----------- | -------------
3 | Ayra Stark | [email protected]
Desired Result
ID | Name | Email
--- | ----------- | -------------
1 | Jonh Snow | [email protected]
--- | ----------- | -------------
2 | Sansa Stark | [email protected]
As far as I understand your question, you need to keep only the emails which are not duplicates in DB1 union DB2?
Well, follow your logic: get the data in (with one
Input tableby DB connection), count the number of records per emails (Memory Group by) andFilterout the emails with a count greater than 1.Use the
Memory Group by, which do not requires sorting. In theGroup fieldput the key:email. And in theAggregatesput theNumber of rows(in the Type drop down), and theFirst Value(orLast Value) ofNameotherwise this column will disappear from the stream.And
Add a sequenceif you need to create the ID on the output.