I'm pretty new to HADOOP and pig .
So . I have a single line json files , all have the same schema :
{"name":"someName","pkg":[{"F1":"abc","F2":"44","F3":"xyz","F4":1024,"info":
[{"timestamp":1372631550000,"value":"122","id":"nnn","name":"ppp"},
{"timestamp":1372649240000,"value":"222","id":"ggg","name":"qqq"}]} ,
{"F1":"abc","f2":"44","F3":"xyz","F4":1024,"new":[{"type":"event1", "time":1372537000000,"more":"
{\"bbad\":\"HELLO\",\"is_done\":0,\"ssss\":-128}"}]}]}
I load all of the json files using elephantbird :
data = LOAD 'browsers/gzip' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') as (json:map[]);
So far the only thing that working for me is querying the "name" field which returns bytearray.
b = foreach data generate json#'name' as name
I then tries to convert it to map instead :
c = FOREACH data GENERATE json#'name' as (m:map[]);
DESCRIBE c;
and get
c: {tuple_0: (m:map[])}
and the data looks like :
({([F1#"abc",F2#44...])})
so now I need to filter all the ones that have pkg.F1 = "abc" or all the ones that have pkg.info.value = 122 etc.
how do I do it?
a code example will be very helpful as I already googled it a lot.
Thanks
The problem is that you don't know how your data is organized in Pig. Use
to find out what the structure returned by
JsonLoaderis, and this should give you enough information about how to extract your data.