How do I query a nested json after loading it with elephant bird

2.6k Views Asked by Rotem Slootzky At 23 October 2013 at 15:06

I'm pretty new to HADOOP and pig .

So . I have a single line json files , all have the same schema :

{"name":"someName","pkg":[{"F1":"abc","F2":"44","F3":"xyz","F4":1024,"info":
[{"timestamp":1372631550000,"value":"122","id":"nnn","name":"ppp"}, 
{"timestamp":1372649240000,"value":"222","id":"ggg","name":"qqq"}]} ,
{"F1":"abc","f2":"44","F3":"xyz","F4":1024,"new":[{"type":"event1", "time":1372537000000,"more":"
{\"bbad\":\"HELLO\",\"is_done\":0,\"ssss\":-128}"}]}]}

I load all of the json files using elephantbird :

data = LOAD 'browsers/gzip' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') as (json:map[]);

So far the only thing that working for me is querying the "name" field which returns bytearray.

b = foreach data generate json#'name' as name

I then tries to convert it to map instead :

c = FOREACH data GENERATE json#'name' as (m:map[]);
DESCRIBE c;

and get

c: {tuple_0: (m:map[])}

and the data looks like :

({([F1#"abc",F2#44...])})

so now I need to filter all the ones that have pkg.F1 = "abc" or all the ones that have pkg.info.value = 122 etc.

how do I do it?

a code example will be very helpful as I already googled it a lot.

Thanks

Original Q&A

There are 2 best solutions below

reo katoa On 23 October 2013 at 15:31

The problem is that you don't know how your data is organized in Pig. Use

DESCRIBE data;

to find out what the structure returned by JsonLoader is, and this should give you enough information about how to extract your data.

Kenneth Xian On 22 January 2014 at 16:29

Try this

c = FOREACH data GENERATE flatten(json#'name') as (m:map[]);

How do I query a nested json after loading it with elephant bird

There are 2 best solutions below

Related Questions in JSON

Related Questions in HADOOP

Related Questions in APACHE-PIG

Related Questions in ELEPHANTBIRD

Trending Questions

Popular # Hahtags

Popular Questions