james.foadi at diamond.ac.uk
2011-Dec-02  15:03 UTC
[R] what is used as height in hclust for ward linkage?
Dear R community,
I am trying to understand how the ward linkage works from a quantitative point
of view.
To test it I have devised a simple 3-members set:
                           G = c(0,2,10)
The distances between all couples are:
d(0,2)  =  2
d(0,10) = 10
d(2,10) =  8
The smallest distance corresponds to merging 0 and 2. The corresponding ESS are:
ESS(0,2) = 2*var(c(0,2)) = 4
ESS(0,10) = 2*var(c(0,10)) = 100
ESS(2,10) = 2*var(c(2,10)) = 64
and, indeed, the smallest ESS corresponds to merging 0 and 2. The next element
that should be added
to 0 and 2 is obviously 10. This is where I don't understand how the hclust
algorithm in R works. We have
> G <- c(0,2,10)
> G.dist <- dist(G)
> G.hc <- hclust(G.dist,method="ward")
> G.hc$merge
     [,1] [,2]
[1,]   -1   -2
[2,]   -3    1> G.hc$height
[1]  2.00000 11.33333
Now, according to standard definitions, the distance between two clusters with
elements Nr and Ns is:
                          d(Rs,Rr) = sqrt(2*Nr*Ns/(Nr+Ns))*||<Rs> -
<Rr>||
where < > in the last expression indicates averages (centroids). If I
carry out this operation to merge cluster
c(0,2) with 10, I get:
                          d(c(0,2),10) = sqrt(2*2*1/(2+1))*|1-9| = 9.237604
This is different from 11.3333 in the R output.
Does anyone know what's the exact value for the ward linkage, as displayed
in the hclust height output?
Thanks in advance for any help!
J
-- 
This e-mail and any attachments may contain confidential...{{dropped:8}}
