I have a data set with 6 variables and 251 cases. The people who supplied me with this data set believe that it falls naturally into three groups, and have given me a rule for determining group number from these 6 variables. If I do scaled.stuff <- scale(stuff, TRUE, c(...the design ranges...)) stuff.dist <- dist(scaled.stuff) stuff.hc <- hclust(stuff.dist) plot(stuff.hc) I get a dendrogram which looks sort of plausible, but (a) with this many leaves, the leaf labels really aren't legible at any plausible scaling, and would be best omitted. I could figure out which point was which if there were some way to use identify(), but I'm justnot seeing it. (b) what I'd really like to do is to colour the leaves according to the predicted group, or some other variable. The obvious thing to try is plot(stuff.hc, col=c("red","green","blue")[stuff.predicted.group]) but that doesn't work. I read everything that seemed plausible, and came across nodePar, but col <- c("red","green","blue")[stuff.predicted.group] plot(stuff.hc, nodePar=list(col=list("black",col))) tells me repeatedly that parameter "nodePar" couldn't be set in high-level plot() function while plot(as.dendrogram(hc), nodePar=list(col=list("black",col))) draws the dendrogram (_much_ slower than plot() does) and still gives me no colouring at all. Clearly I have misunderstood how to use nodePar. (c) The obvious fall-back is to use points() to draw the nodes again in the colours I want, but if I could do that, I could use identify(). The frustrating thing is that when I do d <- dim(stuff))[1] plot(1:d, 1:d, col=col[stuff.hc$order]) shows me that there _is_ a strong connection between the groups found by hclust() and the predicted groups, albeit not a simple one. I have looked at plot.dendrogram() and plotNode() -- using getAnywhere() -- and it looks to me as though what I want *should* be doable, but I've clearly misunderstood the details of how to do it.
Richard A. O'Keefe wrote:> I have a data set with 6 variables and 251 cases. > The people who supplied me with this data set believe that it falls > naturally into three groups, and have given me a rule for determining > group number from these 6 variables.One possibility is to extract the coordinates used by the dendrogram using par("usr") and then to do annotations using ?text, but as a global alternative in cases like this (many cases and known number of classes), I would suggest a different cluster alorithm, e.g. ?kmeans. If you want to get a visual idea you may try to apply an ordination method (e.g. princomp or isoMDS the latter from package MASS) and color the objects according to their class found by kmeans. Hope it helps Thomas P.
I asked about putting some kind of coloured rug under a dendrogram. Thomas Petzoldt <petzoldt at rcs.urz.tu-dresden.de> replied: One possibility is to extract the coordinates used by the dendrogram using par("usr") ... Er, the documentation for par("usr") says 'usr' A vector of the form 'c(x1, x2, y1, y2)' giving the extremes of the user coordinates of the plotting region. When a logarithmic scale is in use (i.e., 'par("xlog")' is true, see below), then the x-limits will be '10 ^ par("usr")[1:2]'. Similarly for the y-axis. But I _know_ the (logical) coordinates of the plotting region; what I need is the coordinates of the leaves of the dendrogram. but as a global alternative in cases like this (many cases and known number of classes), I would suggest a different cluster alorithm, e.g. ?kmeans. That doesn't really help, amongst other things because kmeans is not a hierarchical algorithm. I *DON'T* know the true number of classes. I know how many classes the person who collected the data thinks there are, and I don't need to do any clustering to find them, he gave me a simple rule. What I want to know is how many clusters there OUGHT to be and how similar these clusters are to the ones he thought there were.>From poking around, the "right" number of clusters is somewhere between2 and 6. (For the record, I _have_ tried kmeans and I've tabulated the kmeans groups against the prespecified groups.) If you want to get a visual idea you may try to apply an ordination method (e.g. princomp or isoMDS the latter from package MASS) and color the objects according to their class found by kmeans. I had already done that (using the prespecified classes, not classes found by kmeans). But it didn't solve my present problem, which was overlaying the *prespecified* classes onto a dendrogram. Two other people gave me answers that are spot on. Unfortunately, I've now lost their messages, so I can't name them. Suggestion 1: use the RowSideColors (or ColSideColors) argument of heatmap(). This gives me two dendrograms (and I can suppress one if I want) and a heat image of the data, and all things considered, it's *better* than what I wanted. (I was aware of heatmap, but I'd failed to notice the relevance, or even the existence, of the ???SideColors arguments.) In this particular case, the graph _beautifully_ displays what I want it to display. Suggestion 2: use the draw.clust function from the maptree packages. I have now installed this package (which R makes *so* easy) and it does exactly what I asked for. Both of these approaches work with any dendrogram. I'm beginning to suspect that if something isn't already available in R, I'll never be able to imagine a need for it. But then I'm a bear of very little brain...
How about this: "hc" is your hclust object,"colv" the color vector ordered like the original data and "height" the height of the color bar as fraction of dendrogram height. colorplot.hclust <- function(hc,colv,height=.05) { plot(hc,lab=FALSE,hang=0) stopifnot(length(hc$order) == length(colv)) xy.mat <- list(x=1:length(colv),y=c(-max(hc$height)*height,0)) image(xy.mat,z=matrix(colv[hc$order],ncol=1),add=TRUE) } ## Example: data(iris) hc1 <- hclust(dist(scale(iris[,1:4])),method="ward") colorplot.hclust(hc1,as.numeric(iris[,5])) hth, Martin Keller-Ressel --