andrew mcsweeny
2006-Apr-19 22:37 UTC
[R] determining optimal # of clusters for a given dataset (e.g. between 2 and K)
Hi: I'm clustering a microarray dataset with a large # of samples. I would like your opinion on the best way to automatically determine the optimal # of clusters. Currently I am using the "cluster" package, clustering with "clara", examining the average silhouette width at various numbers of clusters. I'd like opinions on whether any newer packages offer better determination of optimal # of clusters, considering the algorithms in "cluster" were developed decades ago. By the way, I have alot of missing values in my dataset, coded as "NA", so some software packages don't work. Here is the code I've been using: library(cluster) avgsil <- c() for (k in kseq){ clarares <- clara(data, k, rngR = TRUE) savg <- clarares$silinfo$avg.width print(c(k,savg)) avgsil[k] <- savg } k<-kseq plot(k,avgsil[k]) lines(k,avgsil[k]) Sincerely, Andrew McSweeny grad student Medical University of Ohio [[alternative HTML version deleted]]
Andrej Kastrin
2006-Apr-20 05:30 UTC
[R] determining optimal # of clusters for a given dataset (e.g. between 2 and K)
andrew mcsweeny wrote:>Hi: > > I'm clustering a microarray dataset with a large # of samples. I would like your opinion on the best way to automatically determine the optimal # of clusters. Currently I am using the "cluster" package, clustering with "clara", examining the average silhouette width at various numbers of clusters. I'd like opinions on whether any newer packages offer better determination of optimal # of clusters, considering the algorithms in "cluster" were developed decades ago. By the way, I have alot of missing values in my dataset, coded as "NA", so some software packages don't work. > > Here is the code I've been using: > > library(cluster) > avgsil <- c() > >for (k in kseq){ > clarares <- clara(data, k, rngR = TRUE) > savg <- clarares$silinfo$avg.width > print(c(k,savg)) > avgsil[k] <- savg >} > k<-kseq >plot(k,avgsil[k]) >lines(k,avgsil[k]) > > Sincerely, > > Andrew McSweeny > grad student > Medical University of Ohio > > [[alternative HTML version deleted]] > >______________________________________________ >R-help at stat.math.ethz.ch mailing list >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html > > >Following Fraley et al. I suggest to use the Bayesian inference function (BIC). You can find it in mclust package. HTH, Andrej