Testing Cluster Assignment/Pattern Matching BIRCH Clusters

196 Views Asked by At

I have a dataset of size >35K in size / >50 dimensions. Used BIRCH algorithm for clustering. While testing, the data points with which cluster formed is not matching i.e., The data point shows closer to some other cluster than the original cluster. Which is practically incorrect. On analyzing found the issue is due to merging two cluster (one will less and another with very high data points). The center of the resultant cluster will be shifting towards the second cluster leaving points on the edges of first cluster nearer to some other cluster.

Would like to justify my understanding and see if there are any other proven methods to mitigate this issue.

1

There are 1 best solutions below

0
On BEST ANSWER

When implementing BIRCH, it would be easier to first tackle data that has far less overlaps and then confirm everything is in order using all 4 distance measures with good sample data. This get complicated and ugly real fast with BIRCH and becomes a debugging nightmare.

If you are seeing a shift, it might really be a problem in how you are using the intra-cluster distance measure. The other possible explanation is that there is a bug in your CF tree generation itself. Check using some independent well-coded implementation (such as R or Matlab) to see if those points that caused the merge are detected as within overlapping subspace. Then remove those data points causing the overlap and try again in your implementation. If the error goes away, then it is a good indication that you have a bug in CF generation (i.e, you are splitting or merging when you shouldn't be).