Dear all, I'm trying to estimate clusters from a very large dataset using clara but the program stops with a memory error. The (very simple) code and the error: mydata<-read.dbf(file="fnorsel_4px.dbf") my.clara.7k<-clara(mydata,k=7)>Error: cannot allocate vector of size 465108 KbThe dataset contains >3,000,000 rows and 15 columns. I'm using a windows computer with 1.5G RAM; I also tried changing the memory limit to the maximum possible (4000M) Is there a way to calculate clara clusters from such large datasets? Thanks a lot. Nestor.-
>From the help page:'clara' is fully described in chapter 3 of Kaufman and Rousseeuw (1990). Compared to other partitioning methods such as 'pam', it can deal with much larger datasets. Internally, this is achieved by considering sub-datasets of fixed size ('sampsize') such that the time and storage requirements become linear in n rather than quadratic. and the default for 'sampsize' is apparently at least nrow(x). So you need to set 'sampsize' (and perhaps 'samples') appropriately, On Wed, 3 Aug 2005, Nestor Fernandez wrote:> Dear all, > > I'm trying to estimate clusters from a very large dataset using clara but the > program stops with a memory error. The (very simple) code and the error: > > mydata<-read.dbf(file="fnorsel_4px.dbf") > my.clara.7k<-clara(mydata,k=7) > >> Error: cannot allocate vector of size 465108 Kb > > The dataset contains >3,000,000 rows and 15 columns. I'm using a windows > computer with 1.5G RAM; I also tried changing the memory limit to the maximum > possible (4000M)Actually, the limit is probably 2048M: see the rw-FAQ Q on memory limits.> Is there a way to calculate clara clusters from such large datasets?-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
I thought setting keep.data=FALSE might help, but running this on a 32-bit Linux machine, the R process seems to use 1.2 GB until just before clara returns, when it increases to 1.9 GB, regardless of whether keep.data=FALSE or TRUE. Possibly it's the overhead of the .C() interface, but that's mostly an uninformed guess. You could sample your data (say half), remove the original, run clara, keep the mediods, then read your data again and assign each observation to the nearest mediod. This is what clara does anyway, with much smaller samples by default. Reid Huntsinger -----Original Message----- From: r-help-bounces at stat.math.ethz.ch [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Nestor Fernandez Sent: Wednesday, August 03, 2005 12:45 PM To: r-help at stat.math.ethz.ch Subject: [R] clara - memory limit Dear all, I'm trying to estimate clusters from a very large dataset using clara but the program stops with a memory error. The (very simple) code and the error: mydata<-read.dbf(file="fnorsel_4px.dbf") my.clara.7k<-clara(mydata,k=7)>Error: cannot allocate vector of size 465108 KbThe dataset contains >3,000,000 rows and 15 columns. I'm using a windows computer with 1.5G RAM; I also tried changing the memory limit to the maximum possible (4000M) Is there a way to calculate clara clusters from such large datasets? Thanks a lot. Nestor.- ______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>>>>> "Nestor" == Nestor Fernandez <nestor.fernandez at ufz.de> >>>>> on Wed, 03 Aug 2005 18:44:38 +0200 writes:Nestor> I'm trying to estimate clusters from a Nestor> very large dataset using clara but the program stops Nestor> with a memory error. The (very simple) code and the Nestor> error: Nestor> mydata<-read.dbf(file="fnorsel_4px.dbf") Nestor> my.clara.7k<-clara(mydata,k=7) >> Error: cannot allocate vector of size 465108 Kb Nestor> The dataset contains >3,000,000 rows and 15 Nestor> columns. I'm using a windows computer with 1.5G RAM; Nestor> I also tried changing the memory limit to the Nestor> maximum possible (4000M) Is there a way to calculate Nestor> clara clusters from such large datasets? One way to start is reading the help ?clara more carefully and hence use clara(mydata, k=7, keep.data = FALSE) ^^^^^^^^^^^^^^^^^^^ But that might not be enough: You may need 64-bit CPU and an operating system (with system libraries and an R version) that uses 64-bit addressing, i.e., not any current version of M$ Windows. Nestor> Thanks a lot. you're welcome. Martin Maechler, ETH Zurich