thr3ads.net - R help - [R] pam() clustering for large data sets [May 2011]

If this information is useful, please help other people find it:
Share via:

Lilia Nedialkova

2011-May-16 22:26 UTC

[R] pam() clustering for large data sets

Hello everyone,

I need to do k-medoids clustering for data which consists of 50,000
observations.  I have computed distances between the observations
separately and tried to use those with pam().

I got the "cannot allocate vector of length" error and I realize this
job is too memory intensive.  I am at a bit of a loss on what to do at
this point.

I can't use clara(), because I want to use the already computed distances.

What is it that people do to perform clustering for such large data sets?

I would greatly appreciate any form of suggestions that people may have.

Thank you very much in advance.

Christian Hennig

2011-May-17 10:32 UTC

head link

[R] pam() clustering for large data sets

Dear Lilia,

I'm not sure whether this is particularly helpful in your situation, but 
sometimes it is possible to emulate the same (or approximately the same) 
distance measure as Euclidean distance between points that are 
somehow rescaled and retransformed. In this case, you can rescale and 
retransform your original data from which you computed the distances, and 
use clara, which then implicitly computes Euclidean distances.

Of course whether this works depends on the nature of your data and the 
distance measure that you want to use.

Another possibility is to draw a random subset of, say, 3,000 
observations, run pam on it, and assign the remaining ones to their 
closest medoid "manually". Actually this is about what clara does
anyway.

Best regards,
Christian

On Mon, 16 May 2011, Lilia Nedialkova wrote:
> Hello everyone,
>
> I need to do k-medoids clustering for data which consists of 50,000
> observations.  I have computed distances between the observations
> separately and tried to use those with pam().
>
> I got the "cannot allocate vector of length" error and I realize
this
> job is too memory intensive.  I am at a bit of a loss on what to do at
> this point.
>
> I can't use clara(), because I want to use the already computed
distances.
>
> What is it that people do to perform clustering for such large data sets?
>
> I would greatly appreciate any form of suggestions that people may have.
>
> Thank you very much in advance.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chrish at stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

Possibly Parallel Threads

Search for more apparently analagous threads

R help - May 2011 - pam() clustering for large data sets

[R] pam() clustering for large data sets

[R] pam() clustering for large data sets

Possibly Parallel Threads