Identify trending topics in Twitter

646 Views Asked by At

I am using spark streaming to stream real time tweets (filter, only english tweets) and store them in Cassandra, then I am planning to run K-means/ LSI algo (using spark MLib) to identify trending topics.

I need hints on how to represent these tweets in a matrix (vector) representation. Further, I want to know if is it right to train the model with stored data and then run the model with the streamed data?

1

There are 1 best solutions below

2
On

It all depends on the features you are using and the language you are using.

You could represent it as a vector with all the words as columns and each value between 1 and 0 using some kind of metric like TFIDF. Then perform the k-means on a regular RDD (or sparse)

https://spark.apache.org/docs/1.1.0/mllib-clustering.html

https://spark-summit.org/2014/wp-content/uploads/2014/07/sparse_data_support_in_mllib1.pdf