Perform count on similar values in using Pig for multiple line of dataset

186 Views Asked by At

I am new in PIG and trying to solve a problem on wordcount (website) for multiple line of input(websites). For example my input dataset has the value

Input data

Email     websites
e1        web1 web2 web3 web1 ....
e2        web2 web3 web2 web2 web4 ...
e3        web1 web2 web1 web4 .....

and my desired output will be

Email     websites
e1        web1(2) web2(1) web3(1) ....
e2        web2(3) web3(1) web4(1) ...
e3        web1(2) web2(1) web4(1) .....

In my dataset i have almost 50000 email id(user)

1

There are 1 best solutions below

1
On

Assuming email and websites are tab separated and websites themselves are space separated. Following is step by step code to get desired output, the main idea is to first tokenize the websites, flatten them, do group by (email, tokenize_website), generate count, then do a group by email.

A = LOAD 'sample.txt' AS (email:chararray, urls:chararray);
B = FOREACH A GENERATE email AS email, FLATTEN(TOKENIZE(urls)) AS tokenize_urls;

Dumping B

e1  web1
e1  web2
e1  web3
e1  web1
e2  web2
e2  web3
......

Now grouping by (email, tokenized urls) and generating count

C = GROUP B BY (email, tokenize_urls); 
D = FOREACH C GENERATE group.email as email, group.tokenize_urls as url, 
                COUNT(B) as url_count;

Dumping D

e1  web1    2
e1  web2    1
e1  web3    1
e2  web2    3
....

Now grouping by email

   E = GROUP D BY email;

Dumping E

e1  {(e1,web1,2),(e1,web2,1),(e1,web3,1)}
e2  {(e2,web2,3),(e2,web3,1)}
e3  {(e3,web1,2),(e3,web2,1),(e3,web4,1)}
......

PS: I am myself newbie to PIG, so the solution I have may not be optimal.