I am finding top 10 trending tweets in hive on basis of retweet_count ie The tweet which has highest retweet_count will be 1st and so-on....
Here is election Table details
id bigint from deserializer
created_at string from deserializer
source string from deserializer
favorited boolean from deserializer
retweeted_status struct<text:string,user:struct<screen_name:string,name:string>,retweet_count:int> from deserializer
entities struct<urls:array<struct<expanded_url:string>>,user_mentions:array<struct<screen_name:string,name:string>>,hashtags:array<struct<text:string>>> from deserializer
text string from deserializer
user struct<screen_name:string,name:string,friends_count:int,followers_count:int,statuses_count:int,verified:boolean,utc_offset:int,time_zone:string,location:string> from deserializer
in_reply_to_screen_name string from deserializer
my query
select text
from election
where retweeted_status.retweet_count IN
(select retweeted_status.retweet_count as zz
from election
order by zz desc
limit 10);
It is returning me the same tweet 10 times. (TWEET-ABC, TWEET-ABC, TWEET-ABC, . . . TWEET-ABC)
So what I did is break the nested query , when I run inner query
select retweeted_status.retweet_count as zz
from election
order by zz desc
limit 10
It return 10 different values (1210,1209,1208,1207,1206,....1201)
Afterwards when I run my outer query
select text
from election
where retweeted_status.retweet_count
IN (1210,1209,1208,1207,1206,....1201 );
The results are same 10 tweets (TWEET-ABC, TWEET-ABC, TWEET-ABC, . . . TWEET-ABC)
Whats wrong in my query logic ?
Instead of using count you should use id. that is because if you have 100 tweet with same count not matter the LIMIT 10 you will get 100 records.
but still not sure why are you getting the wrong result.
EDIT (after my comment):
if my comment is correct then you will have the same id ten times. In that case change to