I am doing cluster analysis [hclust(Dist, method="average")] on data that potentially contains redundant objects. As expected, the inclusion of redundant objects affects the clustering result, i.e., the data a1, = a2, = a3, b, c, d, e1, = e2 is likely to cluster differently from the same data without the redundancy, i.e., a1, b, c, d, e1. This is apparent when the outcome is visualized as a dendrogram. Now, it seems that the clustering result for which the redundancy has been eliminated is more robust for the present assignment than that of the redundant data. Naturally, there is no problem in the elimination: just exclude the redundant objects from Dist. However, it would be very convenient to be able to include the redundant objects in the *dendrogram* by attaching them as 0-level branches to the subtrees, i.e.: 1.0........-------........ 0.5....___|__...._|_...... 0.0.._|_..|..|..|.._|_.... ....|.|.|.|..|..|.|...|... ...a1a2a3.b..c..d.e1.e2... instead of 1.0........-------........ 0.5....___|__...._|_...... 0.0...|...|..|..|...|..... ......a1..b..c..d..e1..... The question: Can this be accomplished in the *dendrogram plot* by manipulating the resulting hclust data structure or by some other means, and if yes, how? Jopi Harri
Charles C. Berry
2009-Nov-16 17:13 UTC
[R] Cluster analysis: hclust manipulation possible?
On Mon, 16 Nov 2009, Jopi Harri wrote:> I am doing cluster analysis [hclust(Dist, method="average")] on > data that potentially contains redundant objects. As expected, > the inclusion of redundant objects affects the clustering result, > i.e., the data a1, = a2, = a3, b, c, d, e1, = e2 is likely to > cluster differently from the same data without the redundancy, > i.e., a1, b, c, d, e1. This is apparent when the outcome is > visualized as a dendrogram. > > Now, it seems that the clustering result for which the redundancy > has been eliminated is more robust for the present assignment > than that of the redundant data. Naturally, there is no problem > in the elimination: just exclude the redundant objects from Dist. > > However, it would be very convenient to be able to include the > redundant objects in the *dendrogram* by attaching them as > 0-level branches to the subtrees, i.e.: > > 1.0........-------........ > 0.5....___|__...._|_...... > 0.0.._|_..|..|..|.._|_.... > ....|.|.|.|..|..|.|...|... > ...a1a2a3.b..c..d.e1.e2... > > instead of > > 1.0........-------........ > 0.5....___|__...._|_...... > 0.0...|...|..|..|...|..... > ......a1..b..c..d..e1..... > > The question: Can this be accomplished in the *dendrogram plot* > by manipulating the resulting hclust data structure or by some > other means, and if yes, how?Yes, you need to study ?hclust particularly the part about 'Value' from which you will see what needs modification. Here is a very simple example:> res <- hclust(dist(1-diag(3)*rnorm(3))) > plot(res) > res2 <- res > res2$merge <- rbind(-cbind(1:3,4:6), matrix(ifelse( res2$merge<0, -res2$merge, res2$merge+sum(res2$merge<0)),2)) > res2$height <- c(rep(0,3), res2$height) > res2$order <- as.vector( rbind(res2$order,(4:6)[res2$order]) ) > plot(res2) > str( res ) > str( res2 )Alternatively, you could use as.dendrogram( res ) as the point of departure and manipulate the value. HTH, Chuck> > Jopi Harri > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >Charles C. Berry (858) 534-2098 Dept of Family/Preventive Medicine E mailto:cberry at tajo.ucsd.edu UC San Diego http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901
-------- Original Message -------- Subject: Re: [R] Cluster analysis: hclust manipulation possible? Date: Mon, 16 Nov 2009 19:22:54 -0800 From: Charles C. Berry <cberry at tajo.ucsd.edu> To: Jopi Harri <jopi.harri at utu.fi> References: <4B016237.7050706 at utu.fi> <Pine.LNX.4.64.0911160906420.27075 at tajo.ucsd.edu> <4B01BC5D.3020504 at utu.fi> On Mon, 16 Nov 2009, Jopi Harri wrote:> On 16.11.2009 19:13, Charles C. Berry wrote: >>> The question: Can this be accomplished in the *dendrogram plot* >>> by manipulating the resulting hclust data structure or by some >>> other means, and if yes, how? >> >> Yes, you need to study >> >> ?hclust >> >> particularly the part about 'Value' from which you will see what needs >> modification. >> >> Here is a very simple example: >> >>> res <- hclust(dist(1-diag(3)*rnorm(3))) >>> plot(res) >>> res2 <- res >>> res2$merge <- rbind(-cbind(1:3,4:6), matrix(ifelse( res2$merge<0, -res2$merge, res2$merge+sum(res2$merge<0)),2)) >>> res2$height <- c(rep(0,3), res2$height) >>> res2$order <- as.vector( rbind(res2$order,(4:6)[res2$order]) ) >>> plot(res2) >>> str( res ) >>> str( res2 ) > > > Dear Chuck, > > Many thanks for spending your valuable time in the suggestions > and the example. However, the drawback is that as a humanist I > have been having considerable difficulties in figuring out what > exactly to do. After hours of experimenting I could modify > another dendrogram (without crashing R), but still fail to get > the result I want to: the added leaf is not attached to where I > am intending to but instead, another adjacent leaves have their > height turned to 0. > > The question, to put it more clearly perhaps: Is there any > straightforward procedure to just add a single leaf to any > dendrogram, next to an existing leaf at the height 0, and if > there is, what might that be? > > As of now, it seems that the $merge has to be modified correctly, > but what is the exact strategy, if there is one (other than > redoing the whole clustering by hand)?First, read the ?hclust page and see what it says about merge. Then look at a really simple example like cl <- hclust( dist( c(1,2,4) ) ) plot(cl) unclass( cl ) The unclass() strips the class attribute and allows print() to give you a bit more detail. Now make the figure a bit more complicated: cl2 <- hclust(dist(as.matrix(c(1,2,4,4.5)))) plot(cl2) unclass(cl2) and see what has changed in $merge, $height, and $order. Once you get the hang of it, you'll be in a position to modify an existing hclust object. Chuck p.s. it is best to post replies like yours to the whole list; others may want to know the same thing that you want to know or others may give a better reply than I have. >>> Alternatively, you could use as.dendrogram( res ) as the point of >> departure and manipulate the value. > > Possibly, yes, but I am even less well-equipped with editing that > sort of a data type. > > > Sincerely, > > > Jopi Harri > Musicologist > University of Turku > Finland >Charles C. Berry (858) 534-2098 Dept of Family/Preventive Medicine E mailto:cberry at tajo.ucsd.edu UC San Diego http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901
On 17.11.2009 5:22, Charles C. Berry wrote:> > Once you get the hang of it, you'll be in a position to modify an existing > hclust object.I believe that I managed to solve the problem. (The code may not be too refined, and my R is perhaps a bit dialectal. The function may fail especially if the addition of multiple identical labels is attempted.) So, for the addition of a single duplicate label, one needs to increment the positive values in $merge by one, and keep the negative values except for the original of the duplicate which will be given +1. Then, the duplicate pair [the value for the of the new label being -(abs(min($merge))+1)] is added on top of $merge. The other manipulations involved are the addition of height 0, the label for the duplicate, and placing it properly in $order. Once more thanks for the assistance. Jopi Harri dup.hclust=function(Hc,Label,DupLabel) # We add to hclust Hc the duplicate DupLabel of Label. # May fail in certain conditions, but shouldn't in normal use. { if (is.null(Hc$labels)) return("Labels are required!"); Mer=Hc$merge; Hght=Hc$height; Ord=Hc$order; Labs=Hc$labels; DupLNo=abs(min(Mer))+1; LNo=which(Labs==Label); LPlace=which(Labs[Ord]==Label); Hght=c(0,Hght); Labs=c(Labs,DupLabel); Ord=c(Ord[1:LPlace[1]],DupLNo,Ord[LPlace[1]+1:(length(Ord))-LPlace[1]]); NewMer=matrix(ifelse(Mer<0,Mer,Mer+1),nrow(Mer)); NewMer[NewMer==-LNo]=1; NewMer=as.matrix(rbind(-cbind(LNo,DupLNo),NewMer)); NewMer=cbind(NewMer[,1],NewMer[,2]); Hc$merge=NewMer; Hc$height=Hght; Hc$order=Ord; Hc$labels=Labs; return(Hc); }