Wiener, Matthew
2001-Nov-29 17:13 UTC
[R] memory issue trying to solve too large a problem using hclust
Hi, all. I'm trying to cluster 12,500 objects using hclust from package mva. The distance matrix takes up nearly 600 MB. The distance matrix also needs to be copied when being passed to the fortran routine that actually does the clustering (it's modified during the clustering), so that's 1200 MB. I'm actually on a machine with 2.5 GB of memory (and nothing else running), so I thought I could pull this off. The routine quits with the error "cannot allocate a vector of size 609131 KB", which by its size seems to be another copy of the distance matrix, I think the one needed by the fortran routine. As far as I can tell from looking at the code, no additional objects of the size of the distance matrix are used. After the error gc() says that the garbage collection threshold is 1433 MB. I'm wondering whether some additional copies of the distance matrix are being made, and whether I could somehow stop them from being made. Any other suggestions for how I could get around the memory problem would also be appreciated. (I know of clara in the "cluster" package, but would like to use hierarchical methods.) The function hierclust in multiv seems to demand even more memory, even when bign = T. I am running R-1.3.1 on Sun OS 5.6. Thanks for any help. Matthew Wiener Applied Computer Science and Mathematics Department Merck Research Labs Rahway, NJ 07065-0900 732-594-5303 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Agustin Lobo
2001-Nov-30 08:59 UTC
[R] memory issue trying to solve too large a problem using hclust
On Thu, 29 Nov 2001, Wiener, Matthew wrote:> Hi, all. > > I'm trying to cluster 12,500 objects using hclust from package mva. TheBut does this make sense? I often use R for the stat. analysis of remotely sensed imagery, so have much larger datasets MIght I suggest the following: 1. Study a subsample, applying many different methods (including hclust). 2. Define the centroids (both means and dispersions). 3. Use IDL, C or R amb C programs to assign all the objects to a centroid. 4. Select those objects with low maximum similarity and perform a dedicated analysis. Maybe there are rare classes that must be added to the set that was produced in 2., or maybe there are just rare objects that should be left as unclassified. This procedure would have the advantage of expending more of your time at exploring the data than on system adm. issues. But this is just a suggestion. Agus -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
cstrato@EUnet.at
2001-Nov-30 20:04 UTC
[R] memory issue trying to solve too large a problem using hclust
Hi all, hi Matthew I would like to extend this question and take the opportunity to ask all the famous statisticians in this group for advice. First a personal comment :-) I am quite amused, how easy it is sometimes to find out on which project someone writing to this group is working: You mention that you want to cluster 12,500 objects. If I am correct, you are trying to cluster the 12,500 genes on the human Affymetrix GeneChip HgU95A, correct? (At least this is what I am just trying to do) Now to the questions, which I wanted to ask for quite some time: Since the time of the paper: Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A. 1998 Dec 8;95(25):14863-8. most biologists working on gene expression use hierarchical clustering to cluster all genes they have on their DNA-chips. Next year we will see chips containing more than 20,000 genes on one chip. Thus the question is: 1, What is the best way to cluster this amount of genes? Sometimes, I have heard, you should first use k-means to divide the genes into few subclusters, and use hierarchical clustering for the subclusters only. Is this correct? 2, When you do hierarchical clustering, what metric would be best to use? M.Eisen?s paper describes Pearson correlation as metric. Is there a way to implement this metric for use in hclust? Sorrowly, hclust supports only euclid and manhattan. 3, R/S contain some other cluster algorithms such as CLARA, PAM, FANNY, AGNES. However, I have never seen any paper on expression profiling using these algorithms. Is there a special reason, why these functions are not used? 4, Meanwhile, new methods for cluster analysis have been developed. For example, the book "Data Mining" of Han&Kamber mentions BIRCH, CURE, DBSCAN, OPTICS, DENCLUE, STINGS as some of these new algorithms. Would it make sense to use one of these methods? Does someone know if implementations of these functions do exist? 5, As I understand, there does not exist a single "best" cluster algorithm for this purpose, but you have to try different methods, and try to find out which one describes the data best. This is often easy when you cluster samples, but is hard to find out when trying to cluster 20,000 or even more genes. 6, Do there exist better methods other than clustering, which could group genes with similar behavior? PCA may be one method, but is based on dimensionality reduction, which may not be applicable in many cases? I know, that in this group questions to cluster many data have partly been answered, but I have the feeling, that many of these questions remain open, especially, when applied to expression profiling. I also know that many people working in this field use R/S as their main tool, so any help would be appreciated not only from me. Best regards Christian Stratowa ---------------------------------- C.h.r.i.s.t.i.a.n S.t.r.a.t.o.w.a V.i.e.n.n.a, A.u.s.t.r.i.a "Wiener, Matthew" wrote:> Hi, all. > > I'm trying to cluster 12,500 objects using hclust from package mva. The > distance matrix takes up nearly 600 MB. The distance matrix also needs to > be copied when being passed to the fortran routine that actually does the > clustering (it's modified during the clustering), so that's 1200 MB. I'm > actually on a machine with 2.5 GB of memory (and nothing else running), so I > thought I could pull this off. The routine quits with the error "cannot > allocate a vector of size 609131 KB", which by its size seems to be another > copy of the distance matrix, I think the one needed by the fortran routine. > As far as I can tell from looking at the code, no additional objects of the > size of the distance matrix are used. > > After the error gc() says that the garbage collection threshold is 1433 MB. > > I'm wondering whether some additional copies of the distance matrix are > being made, and whether I could somehow stop them from being made. Any > other suggestions for how I could get around the memory problem would also > be appreciated. (I know of clara in the "cluster" package, but would like > to use hierarchical methods.) > > The function hierclust in multiv seems to demand even more memory, even when > bign = T. > > I am running R-1.3.1 on Sun OS 5.6. > > Thanks for any help. > > Matthew Wiener > Applied Computer Science and Mathematics Department > Merck Research Labs > Rahway, NJ 07065-0900 > 732-594-5303 > > -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- > r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html > Send "info", "help", or "[un]subscribe" > (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch > _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._