Incorporating new articles in tfidf vector for online clustering

188 Views Asked by aman2357 At 19 June 2015 at 11:28

I am building an Online news clustering system using Lucene and Mahout libraries in java. I intend to use vector space model and tfidf weights for Kmeans(or fuzzy/streamKmeans). My plan is : Cluster initial articles,assign new article to the cluster whose centroid is closest based on a small distance threshold. The leftover documents that aren’t associated with any old clusters form new data(new topics). Separately cluster them among themselves and add these temporary cluster centroids to the previous centroids. Less frequently, execute the full batch clustering to recluster the entire set of documents. The problem arises in comparing a new article to a centroid to assign it to an old cluster. The centroid dimension is number of distinct words in initial data. But the dimension of new article is different. I am following the book Mahout in Action. Is there any approach or some sort of feature extraction to handle this. The following similar links still remain unanswered: https://stats.stackexchange.com/questions/41409/bag-of-words-in-an-online-configuration-for-classification-clustering https://stats.stackexchange.com/questions/123830/vector-space-model-for-online-news-clustering Thanks in advance

Original Q&A

There are 1 best solutions below

Has QUIT--Anony-Mousse On 20 June 2015 at 16:27

Increase the dimensionality as desired, using 0 as new values.

From a theoretical point of view, consider the vector space as infinite dimensional.

Incorporating new articles in tfidf vector for online clustering

There are 1 best solutions below

Related Questions in CLUSTER-ANALYSIS

Related Questions in MAHOUT

Related Questions in K-MEANS

Related Questions in TEXT-MINING

Related Questions in TF-IDF

Trending Questions

Popular # Hahtags

Popular Questions