thr3ads.net - R help - [R] Help needed to clarify hclust and cutree algorithms [Sep 2009]

If this information is useful, please help other people find it:
Share via:
Dana Sevak
2009-Sep-21 06:04 UTC
[R] Help needed to clarify hclust and cutree algorithms

Dear R Helpers,

I read carefully the documentation and all postings on the hclust and cutree
functions, however some aspects of the tree ordering and cluster assignment
performed by these functions remain unclear to me, so I would very much
appreciate your help in making sure I get them right.

Here is an example, with values chosen to illustrate the problems.

I have a set of five profiles comprised of measurements of five variables (V1 to
V5) in 4 different conditions (c1 to c4).

df = data.frame(rbind(c(-32, -52, -46, -35, -35), c(-86, -111, -101, -96, -105),
c(17, 42, 36, 34, 37), c(24, 37, 28, 29, 30)))

colnames(df) = c("V1", "V2", "V3", "V4",
"V5")
rownames(df) = c("c1", "c2", "c3", "c4")
> df    V1   V2   V3  V4   V5
c1 -32  -52  -46 -35  -35
c2 -86 -111 -101 -96 -105
c3  17   42   36  34   37
c4  24   37   28  29   30

plot(df[,1], type="l", ylim=range(df))
points(df[1,1], type="p", pch=49)
for (i in 2:5) {
   points(df[,i], type="l", col=colors()[15*i])
   points(df[1,i], type="p", pch=48+i)
}

The tasks is to determine how correlated these profiles are and to partition
them in two groups using hierarchical clustering.  Importantly, I need to output
the order in which the variables occur in these clusters, from left to right in
decreasing order of their correlation.  Because of this the number assigned to
the clusters (1 or 2) and the order in which the variables are listed within
them become very important.

For this I used the hclust and cutree functions:

cor.df =  cor(df, method="pearson")
dist.df = as.dist(1-cor.df)

hc.df = hclust(dist.df, method="complete")
hc.df.cl = cutree(hc.df, k=2)
> str(hc.df)List of 7
 $ merge      : int [1:4, 1:2] -4 -2 -1 2 -5 -3 1 3
 $ height     : num [1:4] 0.00043 0.00048 0.004916 0.010176
 $ order      : int [1:5] 2 3 1 4 5
 $ labels     : chr [1:5] "V1" "V2" "V3"
"V4" ...
 $ method     : chr "complete"
 $ call       : language hclust(d = dist.df, method = "complete")
 $ dist.method: NULL
 - attr(*, "class")= chr "hclust"
> hc.df.clV1 V2 V3 V4 V5 
 1  2  2  1  1 
> round(dist.df*1000, 2)      V1    V2    V3    V4
V2 10.18                  
V3 10.11  0.48            
V4  4.42  3.74  2.27      
V5  4.92  6.61  4.33  0.43

plot(hc.df)

My questions are:

1.  Can I assume that plot(hc.df) and hc.df$order indicate that the order of
merging was:

V2 V3 V1 V4 V5 ?

This does not seem to be supported by the distance matrix which shows that the
closest pair to begin with is V4-V5.

Also the element closest to V2 or V3 is V4, and not V1.

The hclust help states that    
     In hierarchical cluster displays, a decision is needed at each
     merge to specify which subtree should go on the left and which on
     the right. Since, for n observations there are n-1 merges, there
     are 2^{(n-1)} possible orderings for the leaves in a cluster tree,
     or dendrogram. The algorithm used in 'hclust' is to order the
     subtree so that the tighter cluster is on the left (the last,
     i.e., most recent, merge of the left subtree is at a lower value
     than the last merge of the right subtree). Single observations are
     the tightest clusters possible, and merges involving two
     observations place them in order by their observation sequence
     number.

In this light shall I look at the plot and $order as a flipped version of 

V1 V4 V5 V3 V2  ?

Would it be possible that somebody could be so kind and actually indicate step
by step how the merges are done?

2. When cutree cuts the tree in two clusters, which number does it assign to the
cluster in which the profiles are most correlated?  Is the numbering simply from
the right to left of the tree as it appears in hc.df$order?

3. If I take into account only the hc.df$order slot and the cluster number
assigned by cutree
> hc.df$order[1] 2 3 1 4 5
> hc.df.clV1 V2 V3 V4 V5 
 1  2  2  1  1 

can I infer that the order of variables from left to right in decreasing order
of correlation between profiles is:

variable V2 V3 V1 V4 V5
cluster  1  1  1  2  2

Is this correct?  It does not seem to be supported by the actual distance
matrix. Even in reverse and with the cluster numbers flipped, the immediate
neighbor of V4 should be V3 and not V1.

3. Most importantly, how I could use the results of these functions to output
the following:

   A. The two clusters, labeled such that cluster 1 contains the pair of
profiles with smallest distance from each other.

   B. The order of variables in decreasing order of correlation (increasing
value of distance).  In this way the value listed after the last entry in
cluster 1 will be the closest in distance to the members of cluster1.

   Can I use only the results of these functions (and how), or do I need to do
other data manipulation (and if so what exactly) to make sure the output
complies to the requirements above?

Thank you very much for your help in clarifying these issues.
> sessionInfo()R version 2.8.1 (2008-12-22) 
i486-pc-linux-gnu 

With best regards,
Dana Sevak
Reasonably Related Threads

Search for more seemingly similar threads
R help - Sep 2009 - Help needed to clarify hclust and cutree algorithms

[R] Help needed to clarify hclust and cutree algorithms

Reasonably Related Threads

Wisdom of the Ancients