I am using spark streaming to stream real time tweets (filter, only english tweets) and store them in Cassandra, then I am planning to run K-means/ LSI algo (using spark MLib) to identify trending topics.
I need hints on how to represent these tweets in a matrix (vector) representation. Further, I want to know if is it right to train the model with stored data and then run the model with the streamed data?
It all depends on the features you are using and the language you are using.
You could represent it as a vector with all the words as columns and each value between 1 and 0 using some kind of metric like TFIDF. Then perform the k-means on a regular RDD (or sparse)
https://spark.apache.org/docs/1.1.0/mllib-clustering.html
https://spark-summit.org/2014/wp-content/uploads/2014/07/sparse_data_support_in_mllib1.pdf