I am beginner, learning Pig latin. Need to extract the records from the file. Have created two files T1 and T2, Some tuples are common to both the files, So need to extract the tuples present only in T1 and need to omit the common tuples between T1 & T2. Can someone please help me...
Thanks
Firstly you'll want to take a look at
this Venn Diagram
. What you want is everything but the middle bit. So first you need to do afull outer JOIN
on the data. Then, sincenulls
are created in an outer JOIN when the key is not common, you will want to filter the result of the JOIN to only contain lines that have one null (the non-intersecting part of the Venn Diagram).This is how it would look in a pig script:
Walking through the steps using this sample input:
B
does the full outer JOIN resulting in:T1
is the left tuple, andT2
is the right tuple. We have to use::
to identify whicht
, since they have the same name.Now,
C
filtersB
so that only lines with a null are kept. Resulting in:This is the output you want, but it is a little messy to use.
D
uses abincond
(the ?:) to remove the null. So the final output will be:Update:
If you want to keep only the left (T1) (or right (T2) if you switch things around) side of the join. You can do this:
However, looking back at the original Venn Diagram, using a full
JOIN
is unnecessary. If you look at thisdifferent Venn Diagram
, you can see that this covers the set you want without any extra operations. Therefore, you should changeB
to: