thr3ads.net - R help - [R] determining optimal # of clusters for a given dataset (e.g. between 2 and K) [Apr 2006]

If this information is useful, please help other people find it:
Share via:

andrew mcsweeny

2006-Apr-19 22:37 UTC

[R] determining optimal # of clusters for a given dataset (e.g. between 2 and K)

Hi:
   
     I'm clustering a microarray dataset with a large # of samples.  I would
like your opinion on the best way to automatically determine the optimal # of
clusters.  Currently I am using the "cluster" package, clustering with
"clara", examining the average silhouette width at various numbers of
clusters.  I'd like opinions on whether any newer packages offer better
determination of optimal # of clusters, considering the algorithms in
"cluster" were developed decades ago.  By the way, I have alot of
missing values in my dataset, coded as "NA", so some software packages
don't work.
   
     Here is the code I've been using:
   
  library(cluster)
  avgsil <- c()
  
for (k in  kseq){
  clarares <- clara(data, k, rngR = TRUE)
  savg <- clarares$silinfo$avg.width
  print(c(k,savg))
  avgsil[k] <- savg
}
  k<-kseq
plot(k,avgsil[k])
lines(k,avgsil[k])
   
  Sincerely,
   
  Andrew McSweeny
  grad student
  Medical University of Ohio

	[[alternative HTML version deleted]]

Andrej Kastrin

2006-Apr-20 05:30 UTC

head link

[R] determining optimal # of clusters for a given dataset (e.g. between 2 and K)

andrew mcsweeny wrote:
>Hi:
>   
>     I'm clustering a microarray dataset with a large # of samples.  I
would like your opinion on the best way to automatically determine the optimal #
of clusters.  Currently I am using the "cluster" package, clustering
with "clara", examining the average silhouette width at various
numbers of clusters.  I'd like opinions on whether any newer packages offer
better determination of optimal # of clusters, considering the algorithms in
"cluster" were developed decades ago.  By the way, I have alot of
missing values in my dataset, coded as "NA", so some software packages
don't work.
>   
>     Here is the code I've been using:
>   
>  library(cluster)
>  avgsil <- c()
>  
>for (k in  kseq){
>  clarares <- clara(data, k, rngR = TRUE)
>  savg <- clarares$silinfo$avg.width
>  print(c(k,savg))
>  avgsil[k] <- savg
>}
>  k<-kseq
>plot(k,avgsil[k])
>lines(k,avgsil[k])
>   
>  Sincerely,
>   
>  Andrew McSweeny
>  grad student
>  Medical University of Ohio
>
>	[[alternative HTML version deleted]]
>
>______________________________________________
>R-help at stat.math.ethz.ch mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html
>
>  
>Following Fraley  et al. I suggest to use the Bayesian inference 
function (BIC). You can find it in mclust package.

HTH, Andrej

Maybe Matching Threads

Search for more maybe matching threads

R help - Apr 2006 - determining optimal # of clusters for a given dataset (e.g. between 2 and K)

[R] determining optimal # of clusters for a given dataset (e.g. between 2 and K)

[R] determining optimal # of clusters for a given dataset (e.g. between 2 and K)

Maybe Matching Threads