Algorithm for identifying non-ambiguous clusters?

231 Views Asked by At

Clustering algorithms usually take into account that what might be perceived as a reasonable cluster by a human being is ambiguous and the computed solution is supposed to generalize and predict well.

This is why I am hesitant to just use the tried-and-tested algos for my specific situation - which does not mean that I am sure those won't work or might actually be optimal. I just would like to verify that.

So let's look at the following example.

enter image description here

Essentially the clusters are obvious and except for exceptions unabmiguous because they are actually linearly separable. The data I refer to is two-dimensional. The clusters follow an unknown distribution with a mode and are independent.

What algorithm performs (speed, robustness, simplicity) well for this specific cluster patterns?


rotate <- function(xy, deg, cen) {
  xy <- xy - cen
  return(c(
    xy[1] * cos(deg) - xy[2] * sin(deg), 
    xy[2] * cos(deg) + xy[1] * sin(deg)
  ) + cen)
}

G <- expand.grid(1:2,1:2)

S <- list()
N <- 100
for(i in 1:nrow(G)) {
  set <- data.frame(x = rgamma(N,3,2)*0.2 + G[i,1], y=rgamma(N,3,2)*.1 + G[i,2])
  S[[i]] <- t(apply(set,1,rotate,runif(1,0,pi),c(mean(set[,1]),mean(set[,2]))))
}

S <- do.call(rbind, S)

plot(S)
4

There are 4 best solutions below

1
On

Standard k-means clustering would work well and fast for the picture you gave. In general, k-means clustering will work well for pictures like yours except in situations where some of your clusters are skinny separated ellipsoids and the center of one ellipsoid is near the far away points of another. If that's the case, then you're probably better off using one of the clustering ideas that greedily clusters together points that are closest to one another, and then hierarchically keeps merging nearby groups of points until a threshold on distance between groups of points is reached (or until you reach the desired number of clusters if you know the number of clusters ahead of time).

The only thing about k-means clustering is that if you use it off-the-shelf, then you need to know how many clusters you want to have. There are ways to choose the number of clusters based on the data though, if you don't know how many clusters to choose, take a look online if you're interested.

1
On

You can try alpha shapes. It a delaunay triangulation with edges deleted exceeding alpha.

0
On

Most algorithms should do a decent job on data as easy as this.

Also have a look at density based clusterings, such as mean-shift, DBSCAN.

But in essence, any should do. Get a toolkit with a wide choice of algorithms such as ELKI, and try out a few.

0
On

I agree with the density based clustering suggestions (e.g., mean-shift). I assume you mean every pair of clusters is linearly separable? If you want to automatically check whether any two clusters are linearly separable it looks like you can compute their convex hulls and check if they intersect: Determine whether the two classes are linearly separable (algorithmically in 2D)

So in theory you could run several trials of mean-shift with different kernel bandwidths, checking each for cluster linear separability with every other cluster and also tracking some type of cluster score (see the Evaluation and assessment section here http://en.wikipedia.org/wiki/Cluster_analysis).

Although your data may look "obvious", coming up with a general solution to always produce the "obvious" output is not trivial unless you have some domain knowledge to take advantage of. For example, if you could make your mean-shift kernel bandwidth (the neighborhood around each point that you weight strongly) a function of some domain property it might be easier.