Retrieving Clustered Datapoints and Finding Optimal Cluster Count in ML.NET

66 Views Asked by At

I'm delving into machine learning and am currently experimenting with multiclass classification in ML.NET.

Background:

  • I have a dataset containing financial data and I aim to infer a category from it.
  • I've already tried a supervised approach which was successful.
  • For comparison, I'd like to try other approaches such as unsupervised and semi-supervised techniques. I'm yet to experiment with the semi-supervised approach.
  • With the unsupervised approach, my initial plan was to cluster the data using the k-Means-algorithm and then observe which datapoints reside within each cluster. Training the model and using it to predict clusters for datapoints is working as expected.

Issues I'm Facing:

  1. Using the official documentation, I learned how to extract the coordinates of the discovered centroids. Is there a similar way for me to extract the datapoints that were assigned to these clusters during training? Or do I need to input all my training data into the model a second time to retrieve the corresponding clusters? I do not only want to know how new data is clustered, but also how my training data was clustered. I hope that makes sense.
  2. Choosing the optimal number of clusters in k-Means can be challenging. I'd like to utilize the Silhouette method for this purpose. From my understanding, this should allow me to compare multiple models without being biased by their number of clusters (number of clusters directly influences ClusteringMetrics.AverageDistance and therefore cannot be used to compare two models with different cluster counts?). To compute the silhouette score, I need the coordinates of each datapoint as well as its assigned cluster. Is it possible to retrieve the coordinates for a particular datapoint or even for all datapoints? If that isn't possible, is there an alternative way to compare multiple models?

In general, I find the workflow of unsupervised learning in ML.NET a bit unclear. Any clarification and assistance?

0

There are 0 best solutions below