Hi, I have a gene expression experiment with 20 samples and 25000 genes each. I'd like to perform clustering on these. It turned out to become much faster when I transform the underlying matrix with t(matrix). Unfortunately then I'm not anymore able to use cutree to access individual clusters. In general I do something like this: hc <- hclust(dist(USArrests), "ave") library(RColorBrewer) library(gplots) clrno=3 cols<-rainbow(clrno, alpha = 1) clstrs <- cutree(hc, k=clrno) ccols <- cols[as.vector(clstrs)] heatcol<-colorRampPalette(c(3,1,2), bias = 1.0)(32) heatmap.2(as.matrix(USArrests), Rowv=as.dendrogram(hc),col=heatcol, trace="none",RowSideColors=ccols) Nice, I can access 3 main clusters with cutree. But what about a situation when I perform hclust like hc <- hclust(dist(t(USArrests)), "ave") which I have to do in order to speed up the clustering process. This I can plot with: heatmap.2(as.matrix(USArrests), Colv=as.dendrogram(hc),col=heatcol, trace="none") But where do I find information about the clustering that was applied to the rows? cutree(hc, k=clrno) delivers the clustering on the columns, so what can I do to access the levels for the rows? I guess the solution is easy, but after ours of playing around I thought it might be a good time to contact the mailing list! Maxim [[alternative HTML version deleted]]
Don't you expect it to be a lot faster if you cluster 20 items instead of
25000?
-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
On Behalf Of Maxim
Sent: Wednesday, March 02, 2011 4:08 PM
To: r-help at r-project.org
Subject: [R] clustering problem
Hi,
I have a gene expression experiment with 20 samples and 25000 genes each.
I'd like to perform clustering on these. It turned out to become much faster
when I transform the underlying matrix with t(matrix). Unfortunately then
I'm not anymore able to use cutree to access individual clusters. In general
I do something like this:
hc <- hclust(dist(USArrests), "ave")
library(RColorBrewer)
library(gplots)
clrno=3
cols<-rainbow(clrno, alpha = 1)
clstrs <- cutree(hc, k=clrno)
ccols <- cols[as.vector(clstrs)]
heatcol<-colorRampPalette(c(3,1,2), bias = 1.0)(32)
heatmap.2(as.matrix(USArrests), Rowv=as.dendrogram(hc),col=heatcol,
trace="none",RowSideColors=ccols)
Nice, I can access 3 main clusters with cutree. But what about a situation
when I perform hclust like
hc <- hclust(dist(t(USArrests)), "ave")
which I have to do in order to speed up the clustering process. This I can
plot with:
heatmap.2(as.matrix(USArrests), Colv=as.dendrogram(hc),col=heatcol,
trace="none")
But where do I find information about the clustering that was applied to the
rows?
cutree(hc, k=clrno) delivers the clustering on the columns, so what can I do
to access the levels for the rows?
I guess the solution is easy, but after ours of playing around I thought it
might be a good time to contact the mailing list!
Maxim
[[alternative HTML version deleted]]
______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
message may contain confidential information. If you are not the designated
recipient, please notify the sender immediately, and delete the original and any
copies. Any use of the message by you is prohibited.
Hi Guys
I want to apply a clustering algo to my dataset in order to find the
regions points(X,Y) which have similar values(percent_GC and
mean_phred_quality). Details below.
I have sampled 1% of points from my main data set of 85 million
points. The result is still somewhat large 800K points and looks
like following.
X Y percent_GC mean_phred_quality
1 4286 930 0.50 0.13
2 4825 947 0.50 20.33
3 8207 932 0.32 26.50
4 8451 940 0.48 24.81
5 9331 931 0.38 16.93
6 11501 949 0.49 31.28
What I want to do is find local regions in which I have associations
between these 4 values i.e points X,Y have close correlation with
percent_GC and mean_phred_quality.
PS: I did calculate the overall pearson correlation coeff between
percent_GC and mean_phred_quality and it is not statistically
significant which got me interested into finding local regions where
it may be.
I would really appreciate your help as I am still a rookie in applying
clustering algorithms.
Thanks!
-Abhi