HDBSCAN on Movielens Latent embeddings does not cluster well

293 Views Asked by At

I am working on a recommendation algorithm, and that has right now boiled down to finding the right clustering algorithm for the job.

Data

The data I'm working with is the MovieLens 100K dataset, from which I've extracted movie titles, genres and tags, and concatenated them into single documents (one for each movie). This gives me about 10000 documents. These have then been vectorized with TFDIF, which I have then autoencoded to 64-dim feature vectors (loss=0.0014 down from 22.14 in 30 epochs). The AutoEncoder is able to reconstruct the data well.

Clustering

Currently, I am working with HDBSCAN, as it should be able to handle datasets with varying density, with non-globular clustering, arbitrary cluster shapes, etc etc. It should be the correct algorithm to use here. The 2D representation of the original 64-dimensional data (gathered by TSNE) shows what seems to be a decently clusterable space, but I cannot get the HDBSCAN algorithm to work properly. Setting the min_cluster_size to 15-30 gives me this, any higher and it sees all points as noise, and lowering gives me this. Or, it just clusters a large majority of points into 1 cluster, with some additional very small clusters, and the rest as noise, like this. It just seems like it can't handle the data, but it does seem to be clusterable to me.

My Questions:

  1. How can fiddling with parameters help HDBSCAN to cluster this space?
  2. Is there a better algorithm for clustering such a space?
  3. Or is the data simply non-clusterable, from what you can see in the plots?

Thanks so much in advance, I've been struggling with this for hours now.

0

There are 0 best solutions below