Remove duplicated row from 2 tables in different database

703 Views Asked by Thiago Augustus Oliveira At 13 September 2017 at 20:24

I have 2 users table in different databases and I would like to get only unique rows from one those tables.

On the following example, I need the list of emails which have no duplicate name.

I am using Pentaho DI Kettle.

Table Users from database 1

ID  | Name        | Email
--- | ----------- | -------------
1   | Jonh Snow   | [email protected]  
--- | ----------- | -------------
2   | Sansa Stark | [email protected]  
--- | ----------- | -------------
3   | Ayra Stark  | [email protected]

Table Users from database 2

ID  | Name        | Email
--- | ----------- | -------------
1   | Jonh Stott  | [email protected]  
--- | ----------- | -------------
2   | Jonh Jonh   | [email protected]  
--- | ----------- | -------------
3   | Ayra Stark  | [email protected]

Desired Result

ID  | Name        | Email
--- | ----------- | -------------
1   | Jonh Snow   | [email protected]  
--- | ----------- | -------------
2   | Sansa Stark | [email protected]

Original Q&A

There are 2 best solutions below

AlainD On 14 September 2017 at 08:19

As far as I understand your question, you need to keep only the emails which are not duplicates in DB1 union DB2?

Well, follow your logic: get the data in (with one Input table by DB connection), count the number of records per emails (Memory Group by) and Filter out the emails with a count greater than 1.

Use the Memory Group by, which do not requires sorting. In the Group field put the key: email. And in the Aggregates put the Number of rows (in the Type drop down), and the First Value (or Last Value) of Name otherwise this column will disappear from the stream.

And Add a sequence if you need to create the ID on the output.

Smrat Srivastava On 19 September 2017 at 06:51

Perform an UNION ALL ( simply join the two files to an dummy)
Perform an Sort on email.
use UNIQUE rows on Name.
use a stream lookup on Name having one input as table1 and second unique rows.
Filter rows on id < 3 and id isnull.

Remove duplicated row from 2 tables in different database

There are 2 best solutions below

Related Questions in PENTAHO

Related Questions in KETTLE

Related Questions in PDI

Related Questions in PENTAHO-DATA-INTEGRATION

Trending Questions

Popular # Hahtags

Popular Questions