Guera
2008-Mar-08 23:01 UTC
[R] Elbow criterion plots for determining k in hierarchical clustering
Hi There, I'm working on some cluster analyses on a large data-set using hclust with Wards method and Manhattan (city block) distance measures. I've created dendrograms to illustrate the clustering criteria, but would like to create a plot to examine for the classic elbow criterion to use in determining the best number of clusters. Ideally I'd like to plot percent variance explained (y axis) against number of clusters (x axis). Is there a way to do this in R base or cluster packages that I'm missing? As an alternative I've attempted to write a function for the purpose, but am unable to find a way to determine the within group variance for each cluster and total variance (needed to compute variance explained). I'm new to R in the last month or so and greatly appreciate any advice you can give me. I've included my code for a subset of the data below (in which k=4 as an example) Thanks in advance, Becky> HClf_dn <- hclust(dist(model.matrix(~-1 + f_dn1+f_dn2+f_dn3+f_dn4, > CwdDbh), method= "manhattan") , method= "ward") > plot(HClf_dn, main= "Cluster Dendrogram for Solution HClf_dn", xlab> "Observation Number in Data Set CwdDbh", sub="Method=ward; > Distance=city-block") > summary(as.factor(cutree(HClf_dn, k = 4))) # Cluster Sizes > by(model.matrix(~-1 + f_dn1 + f_dn2 + f_dn3 + f_dn4, CwdDbh), > as.factor(cutree(HClf_dn, k = 4)), mean) # Cluster Centroids > biplot(princomp(model.matrix(~-1 + f_dn1 + f_dn2 + f_dn3 + f_dn4, > CwdDbh)), xlabs = as.character(cutree(HClf_dn, k = 4)))----- Rebecca Jeppesen, MSc Candidate Acadia University Wolfville, N.S. Canada -- View this message in context: http://www.nabble.com/Elbow-criterion-plots-for-determining-k-in-hierarchical-clustering-tp15921695p15921695.html Sent from the R help mailing list archive at Nabble.com.
Guera
2008-Mar-14 17:08 UTC
[R] Elbow criterion plots for determining k in hierarchical clustering
re:" ... (I) would like to create a plot to examine for the classic elbow criterion to use in determining the best number of clusters. Ideally I'd like to plot percent variance explained (y axis) against number of clusters (x axis).... Is there a way to do this in R...?" I found a way to produce an elbow criterion plot, using height as a measure of dissimilarity. I determined the difference in height between the two most similar clusters at k=x from the dendrogram and plotted this (y) against k (x). It does produce an elbow in the plot which narrows it down considerably, but it is still subject to interpretation. I chose k based on: 1. the location of the elbow on the plot 2. cluster size (e.g. if I had it narrowed down to 4 or 5, and making the fifth produced clusters of say 1 or 2 that weren't there at k=4, I'd use 4) 3. the height of the "tallest" cluster at k=x 4. the eigenvalues from a PCA at k=x. I thought I should reply to my own post since I noticed that the other postings on similar topics also we're replied to, and thought this could possible help others down the road. ----- Rebecca Jeppesen, MSc Candidate Acadia University Wolfville, N.S. Canada -- View this message in context: http://www.nabble.com/Elbow-criterion-plots-for-determining-k-in-hierarchical-clustering-tp15921695p16048615.html Sent from the R help mailing list archive at Nabble.com.