Hi, I have a gene expression experiment with 20 samples and 25000 genes each. I'd like to perform clustering on these. It turned out to become much faster when I transform the underlying matrix with t(matrix). Unfortunately then I'm not anymore able to use cutree to access individual clusters. In general I do something like this: hc <- hclust(dist(USArrests), "ave") library(RColorBrewer) library(gplots) clrno=3 cols<-rainbow(clrno, alpha = 1) clstrs <- cutree(hc, k=clrno) ccols <- cols[as.vector(clstrs)] heatcol<-colorRampPalette(c(3,1,2), bias = 1.0)(32) heatmap.2(as.matrix(USArrests), Rowv=as.dendrogram(hc),col=heatcol, trace="none",RowSideColors=ccols) Nice, I can access 3 main clusters with cutree. But what about a situation when I perform hclust like hc <- hclust(dist(t(USArrests)), "ave") which I have to do in order to speed up the clustering process. This I can plot with: heatmap.2(as.matrix(USArrests), Colv=as.dendrogram(hc),col=heatcol, trace="none") But where do I find information about the clustering that was applied to the rows? cutree(hc, k=clrno) delivers the clustering on the columns, so what can I do to access the levels for the rows? I guess the solution is easy, but after ours of playing around I thought it might be a good time to contact the mailing list! Maxim [[alternative HTML version deleted]]
Don't you expect it to be a lot faster if you cluster 20 items instead of 25000? -----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Maxim Sent: Wednesday, March 02, 2011 4:08 PM To: r-help at r-project.org Subject: [R] clustering problem Hi, I have a gene expression experiment with 20 samples and 25000 genes each. I'd like to perform clustering on these. It turned out to become much faster when I transform the underlying matrix with t(matrix). Unfortunately then I'm not anymore able to use cutree to access individual clusters. In general I do something like this: hc <- hclust(dist(USArrests), "ave") library(RColorBrewer) library(gplots) clrno=3 cols<-rainbow(clrno, alpha = 1) clstrs <- cutree(hc, k=clrno) ccols <- cols[as.vector(clstrs)] heatcol<-colorRampPalette(c(3,1,2), bias = 1.0)(32) heatmap.2(as.matrix(USArrests), Rowv=as.dendrogram(hc),col=heatcol, trace="none",RowSideColors=ccols) Nice, I can access 3 main clusters with cutree. But what about a situation when I perform hclust like hc <- hclust(dist(t(USArrests)), "ave") which I have to do in order to speed up the clustering process. This I can plot with: heatmap.2(as.matrix(USArrests), Colv=as.dendrogram(hc),col=heatcol, trace="none") But where do I find information about the clustering that was applied to the rows? cutree(hc, k=clrno) delivers the clustering on the columns, so what can I do to access the levels for the rows? I guess the solution is easy, but after ours of playing around I thought it might be a good time to contact the mailing list! Maxim [[alternative HTML version deleted]] ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. message may contain confidential information. If you are not the designated recipient, please notify the sender immediately, and delete the original and any copies. Any use of the message by you is prohibited.
Hi Guys I want to apply a clustering algo to my dataset in order to find the regions points(X,Y) which have similar values(percent_GC and mean_phred_quality). Details below. I have sampled 1% of points from my main data set of 85 million points. The result is still somewhat large 800K points and looks like following. X Y percent_GC mean_phred_quality 1 4286 930 0.50 0.13 2 4825 947 0.50 20.33 3 8207 932 0.32 26.50 4 8451 940 0.48 24.81 5 9331 931 0.38 16.93 6 11501 949 0.49 31.28 What I want to do is find local regions in which I have associations between these 4 values i.e points X,Y have close correlation with percent_GC and mean_phred_quality. PS: I did calculate the overall pearson correlation coeff between percent_GC and mean_phred_quality and it is not statistically significant which got me interested into finding local regions where it may be. I would really appreciate your help as I am still a rookie in applying clustering algorithms. Thanks! -Abhi