How do I query a nested json after loading it with elephant bird

2.5k Views Asked by At

I'm pretty new to HADOOP and pig .

So . I have a single line json files , all have the same schema :

{"name":"someName","pkg":[{"F1":"abc","F2":"44","F3":"xyz","F4":1024,"info":
[{"timestamp":1372631550000,"value":"122","id":"nnn","name":"ppp"}, 
{"timestamp":1372649240000,"value":"222","id":"ggg","name":"qqq"}]} ,
{"F1":"abc","f2":"44","F3":"xyz","F4":1024,"new":[{"type":"event1", "time":1372537000000,"more":"
{\"bbad\":\"HELLO\",\"is_done\":0,\"ssss\":-128}"}]}]}

I load all of the json files using elephantbird :

data = LOAD 'browsers/gzip' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') as (json:map[]);

So far the only thing that working for me is querying the "name" field which returns bytearray.

b = foreach data generate json#'name' as name

I then tries to convert it to map instead :

c = FOREACH data GENERATE json#'name' as (m:map[]);
DESCRIBE c;

and get

c: {tuple_0: (m:map[])}

and the data looks like :

({([F1#"abc",F2#44...])})

so now I need to filter all the ones that have pkg.F1 = "abc" or all the ones that have pkg.info.value = 122 etc.

how do I do it?

a code example will be very helpful as I already googled it a lot.

Thanks

2

There are 2 best solutions below

0
On

Try this

c = FOREACH data GENERATE flatten(json#'name') as (m:map[]);
1
On

The problem is that you don't know how your data is organized in Pig. Use

DESCRIBE data;

to find out what the structure returned by JsonLoader is, and this should give you enough information about how to extract your data.