I am currently writing a Hadoop program that outputs the top 100 most tweeted hastags given a data set of tweets. I was able to output all the hashtags with the WordCount
program. So the output looks like this, ignore the quotation marks:
"#USA 2"
"#Holy 5"
"#SOS 3"
"#Love 66"
However, I ran into trouble when I attempt to sort them by their word frequencies (the value) with the code from here.
I noticed that the key are integers instead of strings for the program input provided in the link above. I try changing a few parameters in the code to fit my usage but it didn't work out so well, as I don't understand them so well. Please help me!
You need a second
mapReduce
job, Where the input is the output of your first job.I have tweaked the code to make it work as per your wish.
For Input
The output should be
I have assumed that tab is delimited between hashtag and count. If it is something else, please change that. The code is not tested, please let me know if it works.