thr3ads.net - R help - [R] Hierarchical Cluster Analysis with large dataset [Nov 2013]

If this information is useful, please help other people find it:
Share via:

Petar Milin

2013-Nov-03 09:42 UTC

[R] Hierarchical Cluster Analysis with large dataset

Hello!
Can anyone give me advice on running Hierarchical Cluster Analysis on large
datasets? For example, 80000x10000. Calculating distances on such a
dataframe seems impossible even on very powerful computer.

Also, any other advice that would lead to reduction of dimensionality,
i.e., cluster/group variables would be more than welcomed.

Many thanks,
PM

	[[alternative HTML version deleted]]

Ranjan Maitra

2013-Nov-03 14:01 UTC

head link

[R] Hierarchical Cluster Analysis with large dataset

On Sun, 3 Nov 2013 10:42:06 +0100 Petar Milin
<petar.milin at uni-tuebingen.de> wrote:
> Hello!
> Can anyone give me advice on running Hierarchical Cluster Analysis on large
> datasets? For example, 80000x10000. Calculating distances on such a
> dataframe seems impossible even on very powerful computer.
> 
> Also, any other advice that would lead to reduction of dimensionality,
> i.e., cluster/group variables would be more than welcomed.
You have two different issues here: size of dataset (number of
observations which prevents storage in memory of the distance matrix)
and number of variables (which does not, but probably hinders reading
in the dataset.

You need to provide more information here: why do you need/want to do
hierarchical clustering, if so, do you only need to use R. What
hardware you have at your disposal, etc.

Depending on your answers to the above, this may well be a research
problem in its own right.

HTH!

Best wishes,
Ranjan
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 

-- 
Important Notice: This mailbox is ignored: e-mails are set to be
deleted on receipt. Please respond to the mailing list if appropriate.
For those needing to send personal or professional e-mail, please use
appropriate addresses.

____________________________________________________________
FREE 3D MARINE AQUARIUM SCREENSAVER - Watch dolphins, sharks & orcas on your
desktop!

Bert Gunter

2013-Nov-03 15:34 UTC

head link

[R] Hierarchical Cluster Analysis with large dataset

(Offlist, since this is just a personal comment).

I cannot help you -- but it sounds like the sort of thing that you
should look for on the BioconductoR list.

But I wonder how you could possibly interpret the results even if you
could get them. I would think they would be more noise than signal,
and making sense of such a mess would be hopeless. Maybe you need to
rethink your approach.

No need to respond to me, of course.

Cheers,
Bert

On Sun, Nov 3, 2013 at 1:42 AM, Petar Milin
<petar.milin at uni-tuebingen.de> wrote:> Hello!
> Can anyone give me advice on running Hierarchical Cluster Analysis on large
> datasets? For example, 80000x10000. Calculating distances on such a
> dataframe seems impossible even on very powerful computer.
>
> Also, any other advice that would lead to reduction of dimensionality,
> i.e., cluster/group variables would be more than welcomed.
>
> Many thanks,
> PM
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 

Bert Gunter
Genentech Nonclinical Biostatistics

(650) 467-7374

Sarah Goslee

2013-Nov-03 22:01 UTC

head link

[R] Hierarchical Cluster Analysis with large dataset

Hi,

I think your dataset is too large to be interpretable, but in general
you should check out the cluster package, specifically clara(), which
is intended for use with large data.

Sarah

On Sun, Nov 3, 2013 at 4:42 AM, Petar Milin
<petar.milin at uni-tuebingen.de> wrote:> Hello!
> Can anyone give me advice on running Hierarchical Cluster Analysis on large
> datasets? For example, 80000x10000. Calculating distances on such a
> dataframe seems impossible even on very powerful computer.
>
> Also, any other advice that would lead to reduction of dimensionality,
> i.e., cluster/group variables would be more than welcomed.
>
> Many thanks,
> PM
>-- 
Sarah Goslee
http://www.functionaldiversity.org

Thomas Lumley

2013-Nov-03 22:47 UTC

head link

[R] Hierarchical Cluster Analysis with large dataset

On Sun, Nov 3, 2013 at 10:42 PM, Petar Milin
<petar.milin@uni-tuebingen.de>wrote:
> Hello!
> Can anyone give me advice on running Hierarchical Cluster Analysis on large
> datasets? For example, 80000x10000. Calculating distances on such a
> dataframe seems impossible even on very powerful computer.
>
> Also, any other advice that would lead to reduction of dimensionality,
> i.e., cluster/group variables would be more than welcomed.
>
>
It's going to be slow: does it *have* to be hierarchical?

There are algorithms that don't require the whole distance matrix at once,
but when the number of dimensions is not small I don't think there are any
algorithms taking less than n^2 time even on average.

In applications where I have seen large-n clustering it has mostly been
variants of k-means, which take kn time and space, not n^2.

Look at the Bioconductor flow-cytometry packages.

  -thomas

-- 
Thomas Lumley
Professor of Biostatistics
University of Auckland

	[[alternative HTML version deleted]]

R help - Nov 2013 - Hierarchical Cluster Analysis with large dataset

[R] Hierarchical Cluster Analysis with large dataset

[R] Hierarchical Cluster Analysis with large dataset

[R] Hierarchical Cluster Analysis with large dataset

[R] Hierarchical Cluster Analysis with large dataset

[R] Hierarchical Cluster Analysis with large dataset