james.foadi at diamond.ac.uk
2011-Dec-02 15:03 UTC
[R] what is used as height in hclust for ward linkage?
Dear R community, I am trying to understand how the ward linkage works from a quantitative point of view. To test it I have devised a simple 3-members set: G = c(0,2,10) The distances between all couples are: d(0,2) = 2 d(0,10) = 10 d(2,10) = 8 The smallest distance corresponds to merging 0 and 2. The corresponding ESS are: ESS(0,2) = 2*var(c(0,2)) = 4 ESS(0,10) = 2*var(c(0,10)) = 100 ESS(2,10) = 2*var(c(2,10)) = 64 and, indeed, the smallest ESS corresponds to merging 0 and 2. The next element that should be added to 0 and 2 is obviously 10. This is where I don't understand how the hclust algorithm in R works. We have> G <- c(0,2,10) > G.dist <- dist(G) > G.hc <- hclust(G.dist,method="ward") > G.hc$merge[,1] [,2] [1,] -1 -2 [2,] -3 1> G.hc$height[1] 2.00000 11.33333 Now, according to standard definitions, the distance between two clusters with elements Nr and Ns is: d(Rs,Rr) = sqrt(2*Nr*Ns/(Nr+Ns))*||<Rs> - <Rr>|| where < > in the last expression indicates averages (centroids). If I carry out this operation to merge cluster c(0,2) with 10, I get: d(c(0,2),10) = sqrt(2*2*1/(2+1))*|1-9| = 9.237604 This is different from 11.3333 in the R output. Does anyone know what's the exact value for the ward linkage, as displayed in the hclust height output? Thanks in advance for any help! J -- This e-mail and any attachments may contain confidential...{{dropped:8}}