thr3ads.net - R help - [R] memory issue trying to solve too large a problem using hclust [Nov 2001]

If this information is useful, please help other people find it:
Share via:

Wiener, Matthew

2001-Nov-29 17:13 UTC

[R] memory issue trying to solve too large a problem using hclust

Hi, all.

I'm trying to cluster 12,500 objects using hclust from package mva.  The
distance matrix takes up nearly 600 MB.  The distance matrix also needs to
be copied when being passed to the fortran routine that actually does the
clustering (it's modified during the clustering), so that's 1200 MB. 
I'm
actually on a machine with 2.5 GB of memory (and nothing else running), so I
thought I could pull this off.  The routine quits with the error "cannot
allocate a vector of size 609131 KB", which by its size seems to be another
copy of the distance matrix, I think the one needed by the fortran routine.
As far as I can tell from looking at the code, no additional objects of the
size of the distance matrix are used.

After the error gc() says that the garbage collection threshold is 1433 MB.

I'm wondering whether some additional copies of the distance matrix are
being made, and whether I could somehow stop them from being made.  Any
other suggestions for how I could get around the memory problem would also
be appreciated.  (I know of clara in the "cluster" package, but would
like
to use hierarchical methods.)

The function hierclust in multiv seems to demand even more memory, even when
bign = T.

I am running R-1.3.1 on Sun OS 5.6.

Thanks for any help.

Matthew Wiener
Applied Computer Science and Mathematics Department
Merck Research Labs
Rahway, NJ  07065-0900
732-594-5303


-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Agustin Lobo

2001-Nov-30 08:59 UTC

head link

[R] memory issue trying to solve too large a problem using hclust

On Thu, 29 Nov 2001, Wiener, Matthew wrote:
> Hi, all.
> 
> I'm trying to cluster 12,500 objects using hclust from package mva. 
The
But does this make sense? I often use R for the stat. analysis of remotely
sensed imagery, so have much larger datasets MIght I suggest the
following:

1. Study a subsample, applying 
many different methods (including hclust). 
2. Define the centroids
(both means and dispersions).
3. Use IDL, C or R amb C programs to 
assign all the objects to a centroid.
4. Select those objects with low maximum  similarity
and perform a dedicated analysis. Maybe there are rare
classes that must be added to the set that was 
produced in 2., or maybe
there are just rare objects that should be left
as unclassified.

This procedure would have the advantage of expending more of your
time at exploring the data than on system adm. issues.

But this is just a suggestion.

Agus

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

cstrato@EUnet.at

2001-Nov-30 20:04 UTC

head link

[R] memory issue trying to solve too large a problem using hclust

Hi all, hi Matthew

I would like to extend this question and take the opportunity
to ask all the famous statisticians in this group for advice.

First a personal comment :-)
I am quite amused, how easy it is sometimes to find out on
which project someone writing to this group is working:
You mention that you want to cluster 12,500 objects. If I am
correct, you are trying to cluster the 12,500 genes on the
human Affymetrix GeneChip HgU95A, correct?
(At least this is what  I am just trying to do)

Now to the questions, which I wanted to ask for quite some time:

Since the time of the paper:
Eisen MB, Spellman PT, Brown PO, Botstein D.
Cluster analysis and display of genome-wide expression patterns.
Proc Natl Acad Sci U S A. 1998 Dec 8;95(25):14863-8.
most biologists working on gene expression use hierarchical
clustering to cluster all genes they have on their DNA-chips.
Next year we will see chips containing more than 20,000 genes
on one chip.

Thus the question is:
1, What is the best way to cluster this amount of genes?
Sometimes, I have heard, you should first use k-means to
divide the genes into few subclusters, and use hierarchical
clustering for the subclusters only. Is this correct?

2, When you do hierarchical clustering, what metric would
be best to use?
M.Eisen?s paper describes Pearson correlation as metric.
Is there a way to implement this metric for use in hclust?
Sorrowly, hclust supports only euclid and manhattan.

3, R/S contain some other cluster algorithms such as CLARA,
PAM, FANNY, AGNES. However, I have never seen any paper on
expression profiling using these algorithms. Is there a special
reason, why these functions are not used?

4, Meanwhile, new methods for cluster analysis have been
developed. For example, the book "Data Mining" of Han&Kamber
mentions BIRCH, CURE, DBSCAN, OPTICS, DENCLUE, STINGS
as some of these new algorithms.
Would it make sense to use one of these methods?
Does someone know if implementations of these functions
do exist?

5, As I understand, there does not exist a single "best" cluster
algorithm for this purpose, but you have to try different methods,
and try to find out which one describes the data best.
This is often easy when you cluster samples, but is hard to
find out when trying to cluster 20,000 or even more genes.

6, Do there exist better methods other than clustering, which
could group genes with similar behavior?
PCA may be one method, but is based on dimensionality reduction,
which may not be applicable in many cases?

I know, that in this group questions to cluster many data have
partly been answered, but I have the feeling, that many of these
questions remain open, especially, when applied to expression
profiling.

I also know that many people working in this field use R/S
as their main tool, so any help would be appreciated not only
from me.

Best regards
Christian Stratowa
----------------------------------
C.h.r.i.s.t.i.a.n  S.t.r.a.t.o.w.a
V.i.e.n.n.a,  A.u.s.t.r.i.a

"Wiener, Matthew" wrote:
> Hi, all.
>
> I'm trying to cluster 12,500 objects using hclust from package mva. 
The
> distance matrix takes up nearly 600 MB.  The distance matrix also needs to
> be copied when being passed to the fortran routine that actually does the
> clustering (it's modified during the clustering), so that's 1200
MB.  I'm
> actually on a machine with 2.5 GB of memory (and nothing else running), so
I
> thought I could pull this off.  The routine quits with the error
"cannot
> allocate a vector of size 609131 KB", which by its size seems to be
another
> copy of the distance matrix, I think the one needed by the fortran routine.
> As far as I can tell from looking at the code, no additional objects of the
> size of the distance matrix are used.
>
> After the error gc() says that the garbage collection threshold is 1433 MB.
>
> I'm wondering whether some additional copies of the distance matrix are
> being made, and whether I could somehow stop them from being made.  Any
> other suggestions for how I could get around the memory problem would also
> be appreciated.  (I know of clara in the "cluster" package, but
would like
> to use hierarchical methods.)
>
> The function hierclust in multiv seems to demand even more memory, even
when
> bign = T.
>
> I am running R-1.3.1 on Sun OS 5.6.
>
> Thanks for any help.
>
> Matthew Wiener
> Applied Computer Science and Mathematics Department
> Merck Research Labs
> Rahway, NJ  07065-0900
> 732-594-5303
>
>
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
> r-help mailing list -- Read ci.tuwien.ac.at/~hornik/R/R-FAQ.html
> Send "info", "help", or "[un]subscribe"
> (in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
>
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Maybe Matching Threads

Search for more maybe matching threads

R help - Nov 2001 - memory issue trying to solve too large a problem using hclust

[R] memory issue trying to solve too large a problem using hclust

[R] memory issue trying to solve too large a problem using hclust

[R] memory issue trying to solve too large a problem using hclust

Maybe Matching Threads