I am new in PIG and trying to solve a problem on wordcount (website) for multiple line of input(websites). For example my input dataset has the value
Input data
Email websites
e1 web1 web2 web3 web1 ....
e2 web2 web3 web2 web2 web4 ...
e3 web1 web2 web1 web4 .....
and my desired output will be
Email websites
e1 web1(2) web2(1) web3(1) ....
e2 web2(3) web3(1) web4(1) ...
e3 web1(2) web2(1) web4(1) .....
In my dataset i have almost 50000 email id(user)
Assuming email and websites are tab separated and websites themselves are space separated. Following is step by step code to get desired output, the main idea is to first tokenize the websites, flatten them, do group by (email, tokenize_website), generate count, then do a group by email.
Dumping B
Now grouping by (email, tokenized urls) and generating count
Dumping D
Now grouping by email
Dumping E
PS: I am myself newbie to PIG, so the solution I have may not be optimal.