thr3ads.net - R help - [R] cluster analysis: "error in vector("double", length): given vector size is too big {Fehler in vector("double", length) : angegebene Vektorgröße ist zu groß} [Jan 2006]

If this information is useful, please help other people find it:
Share via:

Markus Preisetanz

2006-Jan-26 19:48 UTC

[R] cluster analysis: "error in vector("double", length): given vector size is too big {Fehler in vector("double", length) : angegebene Vektorgröße ist zu groß}

Dear R Specialists,

 

when trying to cluster a data.frame with about 80.000 rows and 25 columns I get
the above error message. I tried hclust (using dist), agnes (entering the
data.frame directly) and pam (entering the data.frame directly). What I actually
do not want to do is generate a random sample from the data.

 

The machine I run R on is a Windows 2000 Server (Pentium 4) with 2 GB of RAM.

 

Does anybody know what to do?

 

Sincerely

___________________

Markus Preisetanz

Consultant

 

Client Vela GmbH

Albert-Roßhaupter-Str. 32

81369 München

fon:          +49 (0) 89 742 17-113

fax:          +49 (0) 89 742 17-150

mailto:markus.preisetanz@clientvela.com
<mailto:markus.preisetanz@clientvela.com>



Diese E-Mail enthält vertrauliche und/oder rechtlich geschützte Informationen.
Wenn Sie nicht der richtige Adressat sind oder diese E-Mail irrtümlich erhalten
haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Mail.
Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser E-Mail ist nicht
gestattet.

This e-mail may contain confidential and/or privileged infor...{{dropped}}

Liaw, Andy

2006-Jan-27 02:36 UTC

head link

RE: [R] cluster analysis: "error in vector("double", length): given vector size is too big {Fehler in vector("double", length) : angegebene Vektorgröße ist zu groß}

Let's do some simple calculation:  The dist object from a data set with
80000 cases would have

  80000 * (80000 - 1) / 2 

elements, each takes 8 bytes to be stored in double precision.  That's over
24GB if my arithmetic isn't too flaky.  You'd have a devil of a time
trying
to do this on a 64-bit machine with 32GB RAM, let alone what you are using.
You'd have much better chance sticking with algorithms that do not require
storage of the (dis)similarity matrix.

Andy

From: Markus Preisetanz> 
> Dear R Specialists,
> 
>  
> 
> when trying to cluster a data.frame with about 80.000 rows 
> and 25 columns I get the above error message. I tried hclust 
> (using dist), agnes (entering the data.frame directly) and 
> pam (entering the data.frame directly). What I actually do 
> not want to do is generate a random sample from the data.
> 
>  
> 
> The machine I run R on is a Windows 2000 Server (Pentium 4) 
> with 2 GB of RAM.
> 
>  
> 
> Does anybody know what to do?
> 
>  
> 
> Sincerely
> 
> ___________________
> 
> Markus Preisetanz
> 
> Consultant
> 
>  
> 
> Client Vela GmbH
> 
> Albert-Ro??haupter-Str. 32
> 
> 81369 M??nchen
> 
> fon:          +49 (0) 89 742 17-113
> 
> fax:          +49 (0) 89 742 17-150
> 
> mailto:markus.preisetanz at clientvela.com 
> <mailto:markus.preisetanz at clientvela.com> 
> 
> 
> 
> Diese E-Mail enth??lt vertrauliche und/oder rechtlich 
> gesch??tzte Informationen. Wenn Sie nicht der richtige 
> Adressat sind oder diese E-Mail irrt??mlich erhalten haben, 
> informieren Sie bitte sofort den Absender und vernichten Sie 
> diese Mail. Das unerlaubte Kopieren sowie die unbefugte 
> Weitergabe dieser E-Mail ist nicht gestattet.
> 
> This e-mail may contain confidential and/or privileged 
> infor...{{dropped}}
> 
>

Martin Maechler

2006-Jan-27 08:31 UTC

head link

[R] cluster analysis for 80000 observations

>>>>> "Markus" == Markus Preisetanz
<Markus.Preisetanz at clientvela.com>
>>>>>     on Thu, 26 Jan 2006 20:48:29 +0100 writes:
    Markus> Dear R Specialists,
    Markus> when trying to cluster a data.frame with about 80.000 rows and 25
columns I get the above error message. I tried hclust (using dist), agnes
(entering the data.frame directly) and pam (entering the data.frame directly).
What I actually do not want to do is generate a random sample from the data.

Currently all the above mentioned cluster methods work with
full distance / dissimilarity objects, even if only internally,
i.e. they store all d_{i,j} for  1 <= i < j <= n, i.e.  n(n-1)/2
values,
also each of them in double precision, i.e. 8 bytes.

So: no chance with the above functions and n=80'000

 Markus> The machine I run R on is a Windows 2000 Server (Pentium 4) with 2
GB of RAM.

If you would run an machine with a 64-bit version of OS and R
{typical case today: Linux on AMD Opteron}, you could go up
quite a bit higher than on your Windoze box,
{I vaguely remember I could do  'n = a few thousand' on our 
 dual opteron with 16 GBytes}, but 80'000 is definitely too
large.

OTOH, there is clara() in the cluster package, which has been
designed for such situations, 
	 CLARA:= [C]lustering [LAR]ge [A]pplications.
It is similar in spirit to pam(),
*does* cluster all 80'000 observations but does so by taking
sub samples to construct the medoids.
(and you can ask it to take many medium size subsamples, instead
 of just 5 small sized ones as it does by default).

Martin Maechler, ETH Zurich
maintainer of "cluster" package.

Reasonably Related Threads

Search for more maybe matching threads

R help - Jan 2006 - cluster analysis: "error in vector("double", length): given vector size is too big {Fehler in vector("double", length) : angegebene Vektorgröße ist zu groß}

[R] cluster analysis: "error in vector("double", length): given vector size is too big {Fehler in vector("double", length) : angegebene Vektorgröße ist zu groß}

RE: [R] cluster analysis: "error in vector("double", length): given vector size is too big {Fehler in vector("double", length) : angegebene Vektorgröße ist zu groß}

[R] cluster analysis for 80000 observations

Reasonably Related Threads