Ulrich Bodenhofer
2015-Dec-15 11:57 UTC
[R] define number of clusters in kmeans/apcluster analysis
Dear Luigi, As the others have replied already, you cannot expect a clustering algorithm to produce exactly the result that you expect intuitively. The results of clustering algorithms depend largely on the parameters and, even more importantly, on the distance/similarity measure that is used. k-means, for instance, uses the Euclidean distance. As a result, it works nicely for spherical clusters that have approximately the same radius. APCluster, unless you don't choose a different similarity, uses negative squared distances which leads to very similar properties. Your data set consists of two clusters, one of which is much more spread out. That some parts of the larger cluster are being assigned to the other cluster looks weird, but it is perfectly explained by the properties of the algorithms. There is a lot of literature about the properties of clustering algorithms around. That's my 2 cents about this. In your case, however, as already pointed out in Bill Dunlap's reply, the scaling is the more important issue. k-means and apcluster do not perform any scaling of the data. Your two axes differ strongly in terms of scaling. Enter the following to see how the two clustering algorithms "see" your data (i.e. with two equally scaled axes): plot(z, xlim=c(0, 50), ylim=c(0, 50)) Given this, it is no longer surprising that both algorithms split the data in the way they do. Actually, if you re-scale the data, apcluster produces the result you expect: z2 <- scale(z) m <- apclusterK(negDistMat(r=2), z2, K=2, verbose=TRUE) plot(m, z2) plot(m, z) ## it even works to superimpose the clustering result on the original data I hope that helps. Best regards, Ulrich