I have text documents which am clustering using hdbsca. When I have laser amount data around 35 documents and correct values of clusters around 14, then using following paramters I am getting correct result.
def cluster_texts(textdict, eps=0.40,min_samples = 1):
"""
cluster the given texts
Input:
textdict: dictionary with {docid: text}
Returns:
doccats: dictionary with {docid: cluster_id}
"""
doc_ids = list(textdict.keys())
# transform texts into length normalized kpca features
ft = FeatureTransform(norm='max', weight=True, renorm='length', norm_num=False)
docfeats = ft.texts2features(textdict)
X, featurenames = features2mat(docfeats, doc_ids)
e_lkpca = KernelPCA(n_components=12, kernel='linear')
X = e_lkpca.fit_transform(X)
xnorm = np.linalg.norm(X, axis=1)
X = X/xnorm.reshape(X.shape[0], 1)
# compute cosine similarity
D = 1 - linear_kernel(X)
# and cluster with dbscan
clst = hdbscan.HDBSCAN(eps=eps, metric='precomputed', min_samples=min_samples,gen_min_span_tree=True,min_cluster_size=2)
y_pred = clst.fit_predict(D)
return {did: y_pred[i] for i, did in enumerate(doc_ids)}
Now I just replicated data, each document 100 times. And tried to finetune clustering, but now I am getting 36 cluster, each document in different cluster. I tried changing different parameters. but no change in clustering result.
Any suggestion or reference much appreciated.
Obviously if you replicate each point 100 times, you need to increase the minPts parameter 100x and the minimum cluster size, too.
But your main problem likely is KernelPCA - which is sensitive to the amount of samples you have - and not HDBSCAN.