Dear R Helpers,
I read carefully the documentation and all postings on the hclust and cutree
functions, however some aspects of the tree ordering and cluster assignment
performed by these functions remain unclear to me, so I would very much
appreciate your help in making sure I get them right.
Here is an example, with values chosen to illustrate the problems.
I have a set of five profiles comprised of measurements of five variables (V1 to
V5) in 4 different conditions (c1 to c4).
df = data.frame(rbind(c(-32, -52, -46, -35, -35), c(-86, -111, -101, -96, -105),
c(17, 42, 36, 34, 37), c(24, 37, 28, 29, 30)))
colnames(df) = c("V1", "V2", "V3", "V4",
"V5")
rownames(df) = c("c1", "c2", "c3", "c4")
> df
V1 V2 V3 V4 V5
c1 -32 -52 -46 -35 -35
c2 -86 -111 -101 -96 -105
c3 17 42 36 34 37
c4 24 37 28 29 30
plot(df[,1], type="l", ylim=range(df))
points(df[1,1], type="p", pch=49)
for (i in 2:5) {
points(df[,i], type="l", col=colors()[15*i])
points(df[1,i], type="p", pch=48+i)
}
The tasks is to determine how correlated these profiles are and to partition
them in two groups using hierarchical clustering. Importantly, I need to output
the order in which the variables occur in these clusters, from left to right in
decreasing order of their correlation. Because of this the number assigned to
the clusters (1 or 2) and the order in which the variables are listed within
them become very important.
For this I used the hclust and cutree functions:
cor.df = cor(df, method="pearson")
dist.df = as.dist(1-cor.df)
hc.df = hclust(dist.df, method="complete")
hc.df.cl = cutree(hc.df, k=2)
> str(hc.df)
List of 7
$ merge : int [1:4, 1:2] -4 -2 -1 2 -5 -3 1 3
$ height : num [1:4] 0.00043 0.00048 0.004916 0.010176
$ order : int [1:5] 2 3 1 4 5
$ labels : chr [1:5] "V1" "V2" "V3"
"V4" ...
$ method : chr "complete"
$ call : language hclust(d = dist.df, method = "complete")
$ dist.method: NULL
- attr(*, "class")= chr "hclust"
> hc.df.cl
V1 V2 V3 V4 V5
1 2 2 1 1
> round(dist.df*1000, 2)
V1 V2 V3 V4
V2 10.18
V3 10.11 0.48
V4 4.42 3.74 2.27
V5 4.92 6.61 4.33 0.43
plot(hc.df)
My questions are:
1. Can I assume that plot(hc.df) and hc.df$order indicate that the order of
merging was:
V2 V3 V1 V4 V5 ?
This does not seem to be supported by the distance matrix which shows that the
closest pair to begin with is V4-V5.
Also the element closest to V2 or V3 is V4, and not V1.
The hclust help states that
In hierarchical cluster displays, a decision is needed at each
merge to specify which subtree should go on the left and which on
the right. Since, for n observations there are n-1 merges, there
are 2^{(n-1)} possible orderings for the leaves in a cluster tree,
or dendrogram. The algorithm used in 'hclust' is to order the
subtree so that the tighter cluster is on the left (the last,
i.e., most recent, merge of the left subtree is at a lower value
than the last merge of the right subtree). Single observations are
the tightest clusters possible, and merges involving two
observations place them in order by their observation sequence
number.
In this light shall I look at the plot and $order as a flipped version of
V1 V4 V5 V3 V2 ?
Would it be possible that somebody could be so kind and actually indicate step
by step how the merges are done?
2. When cutree cuts the tree in two clusters, which number does it assign to the
cluster in which the profiles are most correlated? Is the numbering simply from
the right to left of the tree as it appears in hc.df$order?
3. If I take into account only the hc.df$order slot and the cluster number
assigned by cutree
> hc.df$order
[1] 2 3 1 4 5
> hc.df.cl
V1 V2 V3 V4 V5
1 2 2 1 1
can I infer that the order of variables from left to right in decreasing order
of correlation between profiles is:
variable V2 V3 V1 V4 V5
cluster 1 1 1 2 2
Is this correct? It does not seem to be supported by the actual distance
matrix. Even in reverse and with the cluster numbers flipped, the immediate
neighbor of V4 should be V3 and not V1.
3. Most importantly, how I could use the results of these functions to output
the following:
A. The two clusters, labeled such that cluster 1 contains the pair of
profiles with smallest distance from each other.
B. The order of variables in decreasing order of correlation (increasing
value of distance). In this way the value listed after the last entry in
cluster 1 will be the closest in distance to the members of cluster1.
Can I use only the results of these functions (and how), or do I need to do
other data manipulation (and if so what exactly) to make sure the output
complies to the requirements above?
Thank you very much for your help in clarifying these issues.
> sessionInfo()
R version 2.8.1 (2008-12-22)
i486-pc-linux-gnu
With best regards,
Dana Sevak