Finding top 10 trending tweets in Hive

393 Views Asked by At

I am finding top 10 trending tweets in hive on basis of retweet_count ie The tweet which has highest retweet_count will be 1st and so-on....

Here is election Table details

id                      bigint                  from deserializer   
created_at              string                  from deserializer   
source                  string                  from deserializer   
favorited               boolean                 from deserializer   
retweeted_status        struct<text:string,user:struct<screen_name:string,name:string>,retweet_count:int>   from deserializer   
entities                struct<urls:array<struct<expanded_url:string>>,user_mentions:array<struct<screen_name:string,name:string>>,hashtags:array<struct<text:string>>> from deserializer   
text                    string                  from deserializer   
user                    struct<screen_name:string,name:string,friends_count:int,followers_count:int,statuses_count:int,verified:boolean,utc_offset:int,time_zone:string,location:string>    from deserializer   
in_reply_to_screen_name string                  from deserializer  

my query

select text 
from election 
where retweeted_status.retweet_count IN  
     (select  retweeted_status.retweet_count as zz 
      from election  
      order by zz desc  
      limit 10);

It is returning me the same tweet 10 times. (TWEET-ABC, TWEET-ABC, TWEET-ABC, . . . TWEET-ABC)

So what I did is break the nested query , when I run inner query

select  retweeted_status.retweet_count as zz 
from election  
order by zz desc  
limit 10

It return 10 different values (1210,1209,1208,1207,1206,....1201)

Afterwards when I run my outer query

select text 
from election  
where retweeted_status.retweet_count 
      IN  (1210,1209,1208,1207,1206,....1201 );

The results are same 10 tweets (TWEET-ABC, TWEET-ABC, TWEET-ABC, . . . TWEET-ABC)

Whats wrong in my query logic ?

1

There are 1 best solutions below

6
On

Instead of using count you should use id. that is because if you have 100 tweet with same count not matter the LIMIT 10 you will get 100 records.

select text 
from election 
where id  IN  
     (select  id as zz 
      from election  
      order by retweeted_status.retweet_count desc  
      limit 10);

but still not sure why are you getting the wrong result.

EDIT (after my comment):

if my comment is correct then you will have the same id ten times. In that case change to

     (select distinct id as zz 
      from election  
      order by retweeted_status.retweet_count desc  
      limit 10);