I'm pretty new to HADOOP and pig .
So . I have a single line json files , all have the same schema :
{"name":"someName","pkg":[{"F1":"abc","F2":"44","F3":"xyz","F4":1024,"info":
[{"timestamp":1372631550000,"value":"122","id":"nnn","name":"ppp"},
{"timestamp":1372649240000,"value":"222","id":"ggg","name":"qqq"}]} ,
{"F1":"abc","f2":"44","F3":"xyz","F4":1024,"new":[{"type":"event1", "time":1372537000000,"more":"
{\"bbad\":\"HELLO\",\"is_done\":0,\"ssss\":-128}"}]}]}
I load all of the json files using elephantbird :
data = LOAD 'browsers/gzip' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') as (json:map[]);
So far the only thing that working for me is querying the "name" field which returns bytearray.
b = foreach data generate json#'name' as name
I then tries to convert it to map instead :
c = FOREACH data GENERATE json#'name' as (m:map[]);
DESCRIBE c;
and get
c: {tuple_0: (m:map[])}
and the data looks like :
({([F1#"abc",F2#44...])})
so now I need to filter all the ones that have pkg.F1 = "abc" or all the ones that have pkg.info.value = 122 etc.
how do I do it?
a code example will be very helpful as I already googled it a lot.
Thanks
Try this