extracting non-matching records between files in Pig Latin

4.4k Views Asked by At

I am beginner, learning Pig latin. Need to extract the records from the file. Have created two files T1 and T2, Some tuples are common to both the files, So need to extract the tuples present only in T1 and need to omit the common tuples between T1 & T2. Can someone please help me...

Thanks

2

There are 2 best solutions below

4
On

Firstly you'll want to take a look at this Venn Diagram. What you want is everything but the middle bit. So first you need to do a full outer JOIN on the data. Then, since nulls are created in an outer JOIN when the key is not common, you will want to filter the result of the JOIN to only contain lines that have one null (the non-intersecting part of the Venn Diagram).

This is how it would look in a pig script:

-- T1 and T2 are the two sets of tuples you are using, their schemas are:
-- T1: {t: (num1: int, num2: int)}
-- T2: {t: (num1: int, num2: int)}
-- Yours will be different, but the principle is the same

B = JOIN T1 BY t FULL, T2 BY t ;
C = FILTER B BY T1::t is null OR T2::t is null ;
D = FOREACH C GENERATE (T1::t is not null? T1::t : A2::t) ;

Walking through the steps using this sample input:

T1:      T2:
(1,2)    (4,5)
(3,4)    (1,2)

B does the full outer JOIN resulting in:

B: {T1::t: (num1: int,num2: int),T2::t: (num1: int,num2: int)}
((1,2),(1,2))
(,(4,5))
((3.4),)

T1 is the left tuple, and T2 is the right tuple. We have to use :: to identify which t, since they have the same name.

Now, C filters B so that only lines with a null are kept. Resulting in:

C: {T1::t: (num1: int,num2: int),T2::t: (num1: int,num2: int)}
(,(4,5))
((3.4),)

This is the output you want, but it is a little messy to use. D uses a bincond (the ?:) to remove the null. So the final output will be:

D: {T1::t: (num1: int,num2: int)}
((4,5))
((3.4))

Update:
If you want to keep only the left (T1) (or right (T2) if you switch things around) side of the join. You can do this:

-- B is the same

-- We only want to keep tuples where the T2 tuple is null
C = FILTER B BY T2::t is null ;
-- Generate T1::t to get rid of the null T2::t
D = FOREACH C GENERATE T1::t ;

However, looking back at the original Venn Diagram, using a full JOIN is unnecessary. If you look at this different Venn Diagram, you can see that this covers the set you want without any extra operations. Therefore, you should change B to:

B = JOIN T1 BY t LEFT, T2 BY t ;
0
On

I believe there is a more efficient way to get it done, particularly if T1 and T2 are very large. I'm working on a dataset with several billions of rows per file, and I'm only interested in the rows of T2 which are not present in T1. Both files have the same schema, and similar size.

T1 = load '/path/to/file1' using PigStorage() as (
  f1,
  f2,
  f3);

T1 = foreach T1 generate
  $0.., --all fields
  1 as day1,
  0 as day2);

T2 = load '/path/to/file2' using PigStorage() as (
  f1,
  f2,
  f3);

T2 = foreach T2 generate
  $0.., --all fields
  0 as day1,
  1 as day2);

T3 = union T1, T2;
-- assuming f1 is your join field
T3grouped = foreach (group T3 by f1) generate 
   flatten(T3),
   sum(T3.day1) as day1,
   sum(T3.day2) as day2;

T3result = filter T3grouped by day1==0;

This will return the rows having f1's which did not appear on day1. It is equivalent to

T3 = T2 by f1 LEFT OUTER, T1 by f1;
T3result = filter T3 by T1::f1 is null

but much faster. The UNION version runs in ~10 minutes, the JOIN version has been running for >2hrs (and still has not finished). Looking at the counters, the UNION version generates more I/O (particularly around the mappers), but uses only 50% the CPU.