Hallo! I applied kmeans to my data: kcluster= kmeans((mydata, 4, iter.max=10) table(code, kcluster$cluster) If I run this code again, I get a different result as with the first trial (I understand that this is correct, since kmeans starts randomly with assigning the clusters and therefore the outcomes can be different) But is there a way to stabilize the cluster (meaning finding the one cluster that appears the most often in 10 trials)? Thank you for any ideas, Julia --
Hi there, If the final predicted clusters vary according to a random starting cluster then I suspect that your data is not clustering very well!! A few reasons for this may be: 1) There are genuinely no clusters in the data! 2) You have chosen a poor distance measure. 3) You have picked an inappropriate number of clusters. The basic goodness of fit of a cluster is that the variance within a cluster is small and the variance between clusters is large. Whenever I start to look for clusters I often use multidimensional scaling to look at the data in 2D! Lookup help(cmdscale) If after this you wish to proceed, then I suggest you look up the library(cluster). The function silhouette is a nice tool to assess the appropriate number of clusters. Regards Wayne -----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]On Behalf Of "Julia Kr?pfl" Sent: 25 September 2007 10:01 To: R-help at r-project.org Subject: [R] finding a stable cluster for kmeans Hallo! I applied kmeans to my data: kcluster= kmeans((mydata, 4, iter.max=10) table(code, kcluster$cluster) If I run this code again, I get a different result as with the first trial (I understand that this is correct, since kmeans starts randomly with assigning the clusters and therefore the outcomes can be different) But is there a way to stabilize the cluster (meaning finding the one cluster that appears the most often in 10 trials)? Thank you for any ideas, Julia -- ______________________________________________ R-help at r-project.org mailing list stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
You might want to check if there is a neural gas algorithm in R. kmeans generally has a high variance since it is very dependent on the initialization. Neural gas overcomes this problem by using a ranked list of neighbouring data points instead using data points directly. It is more stable (at the cost of additional computational time). On 25.09.2007, at 05:00, Julia Kr?pfl wrote:> I applied kmeans to my data: > > kcluster= kmeans((mydata, 4, iter.max=10) > table(code, kcluster$cluster) > > If I run this code again, I get a different result as with the > first trial (I understand that this is correct, since kmeans starts > randomly with assigning the clusters and therefore the outcomes can > be different) > But is there a way to stabilize the cluster (meaning finding the > one cluster that appears the most often in 10 trials)? > > Thank you for any ideas, > Julia > -- > > ______________________________________________ > R-help at r-project.org mailing list > stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide R-project.org/posting- > guide.html > and provide commented, minimal, self-contained, reproducible code.
>>>>> On Tue, 25 Sep 2007 20:16:05 -0400, >>>>> Wiebke Timm (WT) wrote:> You might want to check if there is a neural gas algorithm in R. > kmeans generally has a high variance since it is very dependent on > the initialization. Neural gas overcomes this problem by using a > ranked list of neighbouring data points instead using data points > directly. It is more stable (at the cost of additional computational > time). Neural gas is in package flexclust on CRAN (one of the clustering methods function cclust() privides). I also find it more stable than kmeans for some data, although in general I agree with what has been said before in this thread: instability is in most cases caused by no clear cluster structure of the data, wrong number of clusters etc rather than by the wrong cluster algorithm. Best, -- ----------------------------------------------------------------------- Prof. Dr. Friedrich Leisch Institut f?r Statistik Tel: (+49 89) 2180 3165 Ludwig-Maximilians-Universit?t Fax: (+49 89) 2180 5308 Ludwigstra?e 33 D-80539 M?nchen stat.uni-muenchen.de/~leisch