amvds at xs4all.nl
2009-Jun-11 15:14 UTC
[R] Cluster analysis, defining center seeds or number of clusters
I use kmeans to classify spectral events in high and low 1/3 octave bands: #Do cluster analysis CyclA<-data.frame(LlowA,LhghA) CntrA<-matrix(c(0.9,0.8,0.8,0.75,0.65,0.65), nrow = 3, ncol=2, byrow=TRUE) ClstA<-kmeans(CyclA,centers=CntrA,nstart=50,algorithm="MacQueen") This works well when the actual data shows 1,2 or 3 groups that are not "too close" in a cross plot. The MacQueen algorithm will give one or more empty groups which is what I want. However, there are cases when the groups are closer together, less compact or diffuse which leads to the situation where visually only 2 groups are apparent but the algorithm returns 3 splitting one group in two. I looked at the package 'cluster' specifically at clara (cannot use pam as I have 10000 observations). But clara always returns as many groups as you aks for. Is there a way to help find a seed for the intial cluster centers? Equivalently, is there a way to find a priori the number of groups? I know this is not an easy problem. I have looked at principal components (princomp, prcomp) because there is a connection with cluster analysis. It is not obvious to me how to program that connection though. http://en.wikipedia.org/wiki/Principal_Component_Analysis http://ranger.uta.edu/~chqding/papers/Zha-Kmeans.pdf http://ranger.uta.edu/~chqding/papers/KmeansPCA1.pdf Thanks in advance, Alex van der Spek
Christian Hennig
2009-Jun-11 16:41 UTC
[R] Cluster analysis, defining center seeds or number of clusters
Dear Alex, actually fixing the number of clusters in kmeans end then ending up with a smaller number because of empty clusters is not a standard method of estimating the number of clusters. I may happen (as apparently in some of your examples), but it is generally rather unusual. In most cases, kmeans, as well as clara, pam and other clustering methods, only give you the number of clusters you ask for. Even with some reasonable separation between clusters kmeans cannot generally be expected to come up with empty clusters if the number is initially chosen too high or too many initially centers are specified. The help page for pam.object in library cluster shows you a method to estimate the optimal number of clusters based on pam. However, this problem strongly depends on what cluster concept you have in mind and what you want to use your clusters for. There are alternative indexes that could be optimised to find the best number of clusters. Some of them are implemented in the function cluster.stats in package fpc. I strongly advise reading some literature about this to understand the problem better; the help page of cluster.stats gives a few references. The BIC gives you an estimate of the number of cluster together with Gaussian mixtures, see package mclust. If you can specify things like maximum within-cluster distances, you may get something from using cutree together with a hierarchical clustering method in hclust, for example complete linkage. dbscan and fixmahal in package fpc are further alternatives, requiring one or two tuning constants to come up with an automatical number of clusters. Best regards, Christian On Thu, 11 Jun 2009, amvds at xs4all.nl wrote:> I use kmeans to classify spectral events in high and low 1/3 octave bands: > > #Do cluster analysis > CyclA<-data.frame(LlowA,LhghA) > CntrA<-matrix(c(0.9,0.8,0.8,0.75,0.65,0.65), nrow = 3, ncol=2, byrow=TRUE) > ClstA<-kmeans(CyclA,centers=CntrA,nstart=50,algorithm="MacQueen") > > This works well when the actual data shows 1,2 or 3 groups that are not > "too close" in a cross plot. The MacQueen algorithm will give one or more > empty groups which is what I want. > > However, there are cases when the groups are closer together, less compact > or diffuse which leads to the situation where visually only 2 groups are > apparent but the algorithm returns 3 splitting one group in two. > > I looked at the package 'cluster' specifically at clara (cannot use pam as > I have 10000 observations). But clara always returns as many groups as you > aks for. > > Is there a way to help find a seed for the intial cluster centers? > Equivalently, is there a way to find a priori the number of groups? > > I know this is not an easy problem. I have looked at principal components > (princomp, prcomp) because there is a connection with cluster analysis. It > is not obvious to me how to program that connection though. > > http://en.wikipedia.org/wiki/Principal_Component_Analysis > http://ranger.uta.edu/~chqding/papers/Zha-Kmeans.pdf > http://ranger.uta.edu/~chqding/papers/KmeansPCA1.pdf > > Thanks in advance, > Alex van der Spek > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >*** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chrish at stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche