kosmo7
2012-Feb-23 15:54 UTC
[R] Advice on exploration of sub-clusters in hierarchical dendrogram
Dear R user, I am a biochemist/bioinformatician, at the moment working on protein clusterings by conformation similarity. I only started seriously working with R about a couple of months ago. I have been able so far to read my way through tutorials and set-up my hierarchical clusterings. My problem is that I cannot find a way to obtain information on the rooting of specific nodes, i.e. of specific clusters of interest. In other words, I am trying to obtain/read the sub-clusters of a specific cluster in the dendrogram, by isolating a specific node and exploring locally its lower hierarchy. Please allow me to display some of the code I have been using for your reference: df=read.table('mydata.txt', head=T, row.names=1) #read file with distance matrix d=as.dist(df) #format table as distance matrix z<-hclust(d,method="complete", members=NULL) x<-as.dendrogram(z) plot(x, xlab="mydata complete-LINKAGE", ylim=c(0,4)) #visualization of the dendrogram clusters<-cutree(z, h=1.6) #obtain clusters at cutoff height=1.6 ord<-cmdscale(d, k=2) #Multidimensional scaling of the data down to 2 dimensions clusplot(ord,clusters, color=TRUE, shade=TRUE,labels=4, lines=0) #visualization of the clusters in 2D map var1<-var(clusters==1) #variance of cluster 1 #extract cluster memberships: clids = as.data.frame(clusters) names(clids) = c("id") clids$cdr = row.names(clids) row.names(clids) = c(1:dim(clids)[1]) clstructure = lapply(unique(clids$id), function(x){clids[clids$id =x,'cdr']}) clstructure[[1]] #get memberships of cluster 1>From this point, eventually, I could recreate a distance matrix with onlythe members of a specific cluster and then re-apply hierarchical clustering and start all over again. But this would take me ages to perform individually for hundred of clusters. So, I was hoping if anyone could point me to a direction as to how to take advantage of the initial dendrogram and focus on specific clusters from which to derive the sub-clusters at a new given cutoff height. I recently found in this page http://manuals.bioinformatics.ucr.edu/home/R_BioCondManual http://manuals.bioinformatics.ucr.edu/home/R_BioCondManual the following code: clid <- c(1,2) ysub <- y[names(mycl[mycl%in%clid]),] hrsub <- hclust(as.dist(1-cor(t(ysub), method="pearson")), method="complete") # Select sub-cluster number (here: clid=c(1,2)) and generate corresponding dendrogram. Even with this given example I am afraid I can't work my way around. So I guess in my case I could grab all the members of a specific cluster using my existing code and try to reformat the distance matrix in one that only contains the distances of those members: cluster1members<-clstructure[[1]] Then I need to reformat the distance matrix into a new one, say d1, which I can feed to a new -local- hierarchical clustering: hrsub<-hclust(d1, method="complete") Any ideas on how I can obtain a new distance matrix with just the distances of the members in that clusters, with names contained in vector "cluster1members" ? Apologies if this seems trivial, but I really can't find the correct functions to use for this task. Thank you very much in advance - as I am really a novice with R, small chunks of code as example would be of great help. Take care all - -- View this message in context: http://r.789695.n4.nabble.com/Advice-on-exploration-of-sub-clusters-in-hierarchical-dendrogram-tp4414277p4414277.html Sent from the R help mailing list archive at Nabble.com.
ilai
2012-Feb-23 17:48 UTC
[R] Advice on exploration of sub-clusters in hierarchical dendrogram
See inline On Thu, Feb 23, 2012 at 8:54 AM, kosmo7 <dnicolgr at hotmail.com> wrote:> Dear R user,> In other words, I am trying to obtain/read the sub-clusters of a specific > cluster in the dendrogram, by isolating a specific node and exploring > locally its lower hierarchy.To explore or "zoom in" on elements of z you had the first step right: create x<-as.dendrogram(z) but then you didn't use x anymore (except for the plot which could have been done on z). Maybe you wanted:> df=read.table('mydata.txt', head=T, row.names=1) #read file with distance > matrix > d=as.dist(df) #format table as distance matrix > z<-hclust(d,method="complete", members=NULL) > x<-as.dendrogram(z) > plot(x, xlab="mydata complete-LINKAGE", ylim=c(0,4)) #visualization of the > dendrogram>From this pointclusters<-cut(x, h=1.6) #obtain clusters at cutoff height=1.6 # clusters is now (after cut x not cutree z) a list of two components: upper and lower. Each is in itself a list of dendrograms: the structure above 1.6, and the local clusters below: plot(clusters$upper) # the structure above 1.6 plot(clusters$lower[[1]]) # cluster 1 # To print the details of cluster 1 (this output maybe very long depending on how many members): str(clusters$lower[[1]]) To extract specific details from the list and automate for all or some of the clusters ?dendrapply is your friend. I'm assuming your attempts at reclustering locally later in your post are no longer necessary, unless I'm missing something on what exactly you are trying to do. Hope this helps Elai> ord<-cmdscale(d, k=2) #Multidimensional scaling of the data down to 2 > dimensions > clusplot(ord,clusters, color=TRUE, shade=TRUE,labels=4, lines=0) > #visualization of the clusters in 2D map > var1<-var(clusters==1) #variance of cluster 1 > > #extract cluster memberships: > clids = as.data.frame(clusters) > names(clids) = c("id") > clids$cdr = row.names(clids) > row.names(clids) = c(1:dim(clids)[1]) > clstructure = lapply(unique(clids$id), function(x){clids[clids$id => x,'cdr']}) > > clstructure[[1]] #get memberships of cluster 1 > > > > >From this point, eventually, I could recreate a distance matrix with only > the members of a specific cluster and then re-apply hierarchical clustering > and start all over again. > But this would take me ages to perform individually for hundred of clusters. > So, I was hoping if anyone could point me to a direction as to how to take > advantage of the initial dendrogram and focus on specific clusters from > which to derive the sub-clusters at a new given cutoff height. > > I recently found in this page > http://manuals.bioinformatics.ucr.edu/home/R_BioCondManual > http://manuals.bioinformatics.ucr.edu/home/R_BioCondManual > > the following code: > clid <- c(1,2) > ysub <- y[names(mycl[mycl%in%clid]),] > hrsub <- hclust(as.dist(1-cor(t(ysub), method="pearson")), > method="complete") # Select sub-cluster number (here: clid=c(1,2)) and > generate corresponding dendrogram. > > Even with this given example I am afraid I can't work my way around. > So I guess in my case I could grab all the members of a specific cluster > using my existing code and try to reformat the distance matrix in one that > only contains the distances of those members: > cluster1members<-clstructure[[1]] > > Then I need to reformat the distance matrix into a new one, say d1, which I > can feed to a new -local- hierarchical clustering: > hrsub<-hclust(d1, method="complete") > > Any ideas on how I can obtain a new distance matrix with just the distances > of the members in that clusters, with names contained in vector > "cluster1members" ? > > Apologies if this seems trivial, but I really can't find the correct > functions to use for this task. > Thank you very much in advance - as I am really a novice with R, small > chunks of code as example would be of great help. > > Take care all - > > -- > View this message in context: http://r.789695.n4.nabble.com/Advice-on-exploration-of-sub-clusters-in-hierarchical-dendrogram-tp4414277p4414277.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
kosmo7
2012-Feb-24 14:50 UTC
[R] Advice on exploration of sub-clusters in hierarchical dendrogram
Ok, I was able to work it out finally. As I have been aided myself numerous times from posted questions by other users who have reached in the end a solution to their problem, I will put the code that worked for me for future googlers - it is certainly not optimal but it works: # Initial clustering df=read.table('mydata.txt', head=T, row.names=1) #read file with distance matrix d=as.dist(df) #format table as distance matrix z<-hclust(d,method="complete", members=NULL) x<-as.dendrogram(z) plot(x, xlab="mydata complete-LINKAGE", ylim=c(0,4)) #visualization of the dendrogram clusters<-cutree(z, h=1.6) #obtain clusters at cutoff height=1.6 ord<-cmdscale(d, k=2) #Multidimensional scaling of the data down to 2 dimensions clusplot(ord,clusters, color=TRUE, shade=TRUE,labels=4, lines=0) #visualization of the clusters in 2D map # Local sub-clustering (actually re-clustering on a specific tree node/cluster) h<-as.matrix(d) # transform the distance matrix to a simple matrix. We should ideally work with the initial data table but it sometimes contains an "X" letter preceding labels and there is a risk labels aren't recognized by comparison to name vectors. Distance matrices don't contain the preceding "X" so I transformed it back to a simple matrix (this step might not be required, depending on your initial data table format). clid<-c(1) # Just a column containing the number of the clusters of the initial clustering that you want to pick - separate with commas if more than one clusters,. Here we only want cluster 1. ysub<-h[names(clusters[clusters%in%clid]),] #Remove all rows from the h table that do not begin by the label of a member of cluster 1 ysub<-t(ysub)[names(clusters[clusters%in%clid]),] #We want a rectangular table to be used as distance matrix later on, so we transpose the previous table ysub and remove again the unneeded rows. hrsub<-hclust(as.dist(ysub),method="average") #Perform your preferred hierarchical method on just the initial clusters selected with clid plot(hrsub) ord2<-cmdscale(ysub, k=2) plot(ord2) # Now we can visually "zoom" on the data configuration of just the selected cluster by 2d MDS aa<-silhouette(cutree(hrsub,h=1.2),as.dist(ysub)) #We can perform silhouette analysis localy on the selected cluster (by clid) plot(aa) clusplot(ord2,cutree(hrsub,h=1.2), color=TRUE, shade=TRUE,labels=4, lines=0) # clusterplot of the subclusters Thanks for reading - take care all. PS. If anyone can write all these things in a more efficient way, please feel free to add a comment. -- View this message in context: http://r.789695.n4.nabble.com/Advice-on-exploration-of-sub-clusters-in-hierarchical-dendrogram-tp4414277p4417419.html Sent from the R help mailing list archive at Nabble.com.