I am beginner, learning Pig latin. Need to extract the records from the file. Have created two files T1 and T2, Some tuples are common to both the files, So need to extract the tuples present only in T1 and need to omit the common tuples between T1 & T2. Can someone please help me...
Thanks
Firstly you'll want to take a look at
this Venn Diagram. What you want is everything but the middle bit. So first you need to do afull outer JOINon the data. Then, sincenullsare created in an outer JOIN when the key is not common, you will want to filter the result of the JOIN to only contain lines that have one null (the non-intersecting part of the Venn Diagram).This is how it would look in a pig script:
Walking through the steps using this sample input:
Bdoes the full outer JOIN resulting in:T1is the left tuple, andT2is the right tuple. We have to use::to identify whicht, since they have the same name.Now,
CfiltersBso that only lines with a null are kept. Resulting in:This is the output you want, but it is a little messy to use.
Duses abincond(the ?:) to remove the null. So the final output will be:Update:
If you want to keep only the left (T1) (or right (T2) if you switch things around) side of the join. You can do this:
However, looking back at the original Venn Diagram, using a full
JOINis unnecessary. If you look at thisdifferent Venn Diagram, you can see that this covers the set you want without any extra operations. Therefore, you should changeBto: