Markus Preisetanz
2006-Jan-26  19:48 UTC
[R] cluster analysis: "error in vector("double", length): given vector size is too big {Fehler in vector("double", length) : angegebene Vektorgröße ist zu groß}
Dear R Specialists,
 
when trying to cluster a data.frame with about 80.000 rows and 25 columns I get
the above error message. I tried hclust (using dist), agnes (entering the
data.frame directly) and pam (entering the data.frame directly). What I actually
do not want to do is generate a random sample from the data.
 
The machine I run R on is a Windows 2000 Server (Pentium 4) with 2 GB of RAM.
 
Does anybody know what to do?
 
Sincerely
___________________
Markus Preisetanz
Consultant
 
Client Vela GmbH
Albert-Roßhaupter-Str. 32
81369 München
fon:          +49 (0) 89 742 17-113
fax:          +49 (0) 89 742 17-150
mailto:markus.preisetanz@clientvela.com
<mailto:markus.preisetanz@clientvela.com>
Diese E-Mail enthält vertrauliche und/oder rechtlich geschützte Informationen.
Wenn Sie nicht der richtige Adressat sind oder diese E-Mail irrtümlich erhalten
haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Mail.
Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser E-Mail ist nicht
gestattet.
This e-mail may contain confidential and/or privileged infor...{{dropped}}
Liaw, Andy
2006-Jan-27  02:36 UTC
RE: [R] cluster analysis: "error in vector("double", length): given vector size is too big {Fehler in vector("double", length) : angegebene Vektorgröße ist zu groß}
Let's do some simple calculation: The dist object from a data set with 80000 cases would have 80000 * (80000 - 1) / 2 elements, each takes 8 bytes to be stored in double precision. That's over 24GB if my arithmetic isn't too flaky. You'd have a devil of a time trying to do this on a 64-bit machine with 32GB RAM, let alone what you are using. You'd have much better chance sticking with algorithms that do not require storage of the (dis)similarity matrix. Andy From: Markus Preisetanz> > Dear R Specialists, > > > > when trying to cluster a data.frame with about 80.000 rows > and 25 columns I get the above error message. I tried hclust > (using dist), agnes (entering the data.frame directly) and > pam (entering the data.frame directly). What I actually do > not want to do is generate a random sample from the data. > > > > The machine I run R on is a Windows 2000 Server (Pentium 4) > with 2 GB of RAM. > > > > Does anybody know what to do? > > > > Sincerely > > ___________________ > > Markus Preisetanz > > Consultant > > > > Client Vela GmbH > > Albert-Ro??haupter-Str. 32 > > 81369 M??nchen > > fon: +49 (0) 89 742 17-113 > > fax: +49 (0) 89 742 17-150 > > mailto:markus.preisetanz at clientvela.com > <mailto:markus.preisetanz at clientvela.com> > > > > Diese E-Mail enth??lt vertrauliche und/oder rechtlich > gesch??tzte Informationen. Wenn Sie nicht der richtige > Adressat sind oder diese E-Mail irrt??mlich erhalten haben, > informieren Sie bitte sofort den Absender und vernichten Sie > diese Mail. Das unerlaubte Kopieren sowie die unbefugte > Weitergabe dieser E-Mail ist nicht gestattet. > > This e-mail may contain confidential and/or privileged > infor...{{dropped}} > >
>>>>> "Markus" == Markus Preisetanz <Markus.Preisetanz at clientvela.com> >>>>> on Thu, 26 Jan 2006 20:48:29 +0100 writes:Markus> Dear R Specialists, Markus> when trying to cluster a data.frame with about 80.000 rows and 25 columns I get the above error message. I tried hclust (using dist), agnes (entering the data.frame directly) and pam (entering the data.frame directly). What I actually do not want to do is generate a random sample from the data. Currently all the above mentioned cluster methods work with full distance / dissimilarity objects, even if only internally, i.e. they store all d_{i,j} for 1 <= i < j <= n, i.e. n(n-1)/2 values, also each of them in double precision, i.e. 8 bytes. So: no chance with the above functions and n=80'000 Markus> The machine I run R on is a Windows 2000 Server (Pentium 4) with 2 GB of RAM. If you would run an machine with a 64-bit version of OS and R {typical case today: Linux on AMD Opteron}, you could go up quite a bit higher than on your Windoze box, {I vaguely remember I could do 'n = a few thousand' on our dual opteron with 16 GBytes}, but 80'000 is definitely too large. OTOH, there is clara() in the cluster package, which has been designed for such situations, CLARA:= [C]lustering [LAR]ge [A]pplications. It is similar in spirit to pam(), *does* cluster all 80'000 observations but does so by taking sub samples to construct the medoids. (and you can ask it to take many medium size subsamples, instead of just 5 small sized ones as it does by default). Martin Maechler, ETH Zurich maintainer of "cluster" package.