thr3ads.net - R devel - [Rd] Q: R 2.2.1: Memory Management Issues? [Jan 2006]

If this information is useful, please help other people find it:
Share via:

Karen.Green@sanofi-aventis.com

2006-Jan-05 22:18 UTC

[Rd] Q: R 2.2.1: Memory Management Issues?

Dear Developers:

I have a question about memory management in R 2.2.1 and am wondering if you
would be kind enough to help me understand what is going on.
(It has been a few years since I have done software development on Windows, so I
apologize in advance if these are easy questions.)

-------------
MY SYSTEM
-------------

I am currently using R (version 2.2.1) on a PC running Windows 2000 (Intel
Pentium M) that has 785,328 KB (a little over 766 MB) of physical RAM.
The R executable resides on the C drive, which is of NTFS format, says it has
15.08 GB free space and has recently been defragmented.

The report of that defragmented drive gives:
------------------------------------------------
Volume (C:):
	Volume size	=	35,083 MB
	Cluster size	=	512 bytes
	Used space	=	19,642 MB
	Free space	=	15,440 MB
	Percent free space	=	44 %

Volume fragmentation
	Total fragmentation	=	1 %
	File fragmentation	=	2 %
	Free space fragmentation	=	0 %

File fragmentation
	Total files	=	121,661
	Average file size	=	193 KB
	Total fragmented files	=	64
	Total excess fragments	=	146
	Average fragments per file	=	1.00

Pagefile fragmentation
	Pagefile size	=	768 MB
	Total fragments	=	1

Directory fragmentation
	Total directories	=	7,479
	Fragmented directories	=	2
	Excess directory fragments	=	3

Master File Table (MFT) fragmentation
	Total MFT size	=	126 MB
	MFT record count	=	129,640
	Percent MFT in use	=	99 %
	Total MFT fragments	=	4
------------------------------------------------------
PROBLEM
---------

I am trying to run a R script which makes use of the MCLUST package.
The script can successfully read in the approximately 17000 data points ok, but
then throws an error:
--------------------------------------------------------
Error:  cannot allocate vector of size 1115070Kb
In addition:  Warning messages:
1:  Reached total allocation of # Mb:  see help(memory.size)
2:  Reached total allocation of # Mb:  see help(memory.size)
Execution halted
--------------------------------------------------------
after attempting line:
summary(EMclust(y),y)
which is computationally intensive (performs a "deconvolution" of the
data into a series of Gaussian peaks)

and where # is either 766Mb or 2048Mb (depending on the max memory size I set).

The call I make is to Rterm.exe (to try to avoid Windows overhead):
"C:\Program Files\R\R-2.2.1\bin\Rterm.exe" --no-save --no-restore
--vanilla --silent --max-mem-size=766M <
"C:\Program Files\R\R-2.2.1\dTest.R"

(I have also tried it with 2048M but with same lack of success.)

------------
QUESTIONS  
------------

(1) I had initially thought that Windows 2000 should be able to allocate up to
about 2 GB memory.  So, why is there a problem to allocate a little over 1GB on
a defragmented disk with over 15 GB free?  (Is this a pagefile size issue?)

(2) Do you think the origin of the problem is 
    (a) the R environment, or 
    (b) the function in the MCLUST package using an in-memory instead of an
on-disk approach?

(3)
    (a) If the problem originates in the R environment, would switching to the
Linux version of R solve the problem?
    (b) If the problem originates in the function in the MCLUST package, whom do
I need to contact to get more information about re-writing the source code to
handle large datasets?


Information I have located on overcoming Windows2000 memory allocation limits
[http://www.rsinc.com/services/techtip.asp?ttid=3346;
http://www.petri.co.il/pagefile_optimization.htm] does not seem to help me
understand this any better.

I had initially upgraded to R version 2.2.1 because I had read
[https://svn.r-project.org/R/trunk/src/gnuwin32/CHANGES/]:
------------------------------------------------------------------------------------
R 2.2.1
======Using the latest binutils allows us to distribute RGui.exe and Rterm.exe
as large-address-aware (see the rw-FAQ Q2.9).

The maximum C stack size for RGui.exe and Rterm.exe has been increased
to 10Mb (from 2Mb); this is comparable with the default on Linux systems
and may allow some larger programs to run without crashes.  ... 
------------------------------------------------------------------------------------
and also from the Windows FAQ
[http://cran.r-project.org/bin/windows/base/rw-FAQ.html#There-seems-to-be-a-limit-on-the-memory-it-uses_0021]:
------------------------------------------------------------------------------------
2.9 There seems to be a limit on the memory it uses!
Indeed there is. It is set by the command-line flag --max-mem-size (see How do I
install R for Windows?) and defaults to the smaller of the amount of physical
RAM in the machine and 1Gb. It can be set to any amount over 16M. (R will not
run in less.) Be aware though that Windows has (in most versions) a maximum
amount of user virtual memory of 2Gb.
Use ?Memory and ?memory.size for information about memory usage. The limit can
be raised by calling memory.limit within a running R session.
R can be compiled to use a different memory manager which might be better at
using large amounts of memory, but is substantially slower (making R several
times slower on some tasks).
In this version of R, the executables support up to 3Gb per process under
suitably enabled versions of Windows (see
<http://www.microsoft.com/whdc/system/platform/server/PAE/PAEmem.mspx>).
------------------------------------------------------------------------------------

Thank you in advance for any help you might be able to provide, 

Karen
---
Karen M. Green, Ph.D.
Karen.Green@sanofi-aventis.com
Research Investigator
Drug Design Group
Sanofi Aventis Pharmaceuticals
1580 E. Hanley Blvd.
Tucson, AZ  85737-9525


	[[alternative HTML version deleted]]

Simon Urbanek

2006-Jan-06 00:12 UTC

head link

[Rd] Q: R 2.2.1: Memory Management Issues?

Karen,

On Jan 5, 2006, at 5:18 PM, <Karen.Green at sanofi-aventis.com>  
<Karen.Green at sanofi-aventis.com> wrote:
> I am trying to run a R script which makes use of the MCLUST package.
> The script can successfully read in the approximately 17000 data  
> points ok, but then throws an error:
> --------------------------------------------------------
> Error:  cannot allocate vector of size 1115070Kb
This is 1.1GB of RAM to allocate alone for one vector(!). As you  
stated yourself the total upper limit is 2GB, so you cannot even fit  
two of those in memory anyway - not much you can do with it even if  
it is allocated.
> summary(EMclust(y),y)
I suspect that memory is your least problem. Did you even try to run  
EMclust on a small subsample? I suspect that if you did, you would  
figure out that what you are trying to do is not likely to terminate  
within days...
> (1) I had initially thought that Windows 2000 should be able to  
> allocate up to about 2 GB memory.  So, why is there a problem to  
> allocate a little over 1GB on a defragmented disk with over 15 GB  
> free?  (Is this a pagefile size issue?)
Because that is not the only 1GB vector that is allocated. Your "15GB/ 
defragmented" are irrelevant - if at all, look how much virtual  
memory is set up in you system's preferences.
> (2) Do you think the origin of the problem is
>     (a) the R environment, or
>     (b) the function in the MCLUST package using an in-memory  
> instead of an on-disk approach?
Well, a toy example of 17000x2 needs 2.3GB and it's unlikely to  
terminate anytime soon, so I'd rather call it shooting with the wrong  
gun. Maybe you should consider different approach to your problem -  
possibly ask at the BioConductor list, because people there have more  
experience with large data and this is not really a technical  
question about R, but rather how to apply statistical methods.
> (3)
>     (a) If the problem originates in the R environment, would  
> switching to the Linux version of R solve the problem?
Any reasonable unix will do - technically (64-bit versions  
preferably, but in your case even 32-bit would do). Again, I don't  
think memory is your only problem here, though.

Cheers,
Simon

Karen.Green@sanofi-aventis.com

2006-Jan-06 00:33 UTC

head link

[Rd] Q: R 2.2.1: Memory Management Issues?

Dear Simon,

Thank you for taking time to address my questions.
>> summary(EMclust(y),y)
>
>I suspect that memory is your least problem. Did you even try to run  
>EMclust on a small subsample? I suspect that if you did, you would  
>figure out that what you are trying to do is not likely to terminate  
>within days...
The empirically derived limit on my machine (under R 1.9.1) was approximately
7500 data points.
I have been able to successfully run the script that uses package MCLUST on
several hundred smaller data sets.

I even had written a work-around for the case of greater than 9600 data points. 
My work-around first orders the
points by their value then takes a sample (e.g. every other point or 1 point
every n points) in order to bring the number under 9600.  No problems with the
computations were observed, but you are correct that a deconvolution on that
larger dataset of 9600 takes almost 30 minutes.  However, for our purposes, we
do not have many datasets over 9600 so the time is not a major constraint.

Unfortunately, my management does not like using a work-around and really wants
to operate on the larger data sets.
I was told to find a way to make it operate on the larger data sets or avoid
using R and find another solution.
>From previous programming projects in a different scientific field long ago,
I recall making a trade-off of using temp files instead of holding data in
memory in order to make working with larger data sets possible.  I am wondering
if something like that would be possible for this situation, but I don't
have enough knowledge at this moment to make this decision.
Karen
---
Karen M. Green, Ph.D.
Karen.Green at sanofi-aventis.com
Research Investigator
Drug Design Group
Sanofi Aventis Pharmaceuticals
Tucson, AZ  85737

-----Original Message-----
From: Simon Urbanek [mailto:simon.urbanek at r-project.org]
Sent: Thursday, January 05, 2006 5:13 PM
To: Green, Karen M. PH/US
Cc: R-devel at stat.math.ethz.ch
Subject: Re: [Rd] Q: R 2.2.1: Memory Management Issues?
Importance: High


Karen,

On Jan 5, 2006, at 5:18 PM, <Karen.Green at sanofi-aventis.com>  
<Karen.Green at sanofi-aventis.com> wrote:
> I am trying to run a R script which makes use of the MCLUST package.
> The script can successfully read in the approximately 17000 data  
> points ok, but then throws an error:
> --------------------------------------------------------
> Error:  cannot allocate vector of size 1115070Kb
This is 1.1GB of RAM to allocate alone for one vector(!). As you  
stated yourself the total upper limit is 2GB, so you cannot even fit  
two of those in memory anyway - not much you can do with it even if  
it is allocated.
> summary(EMclust(y),y)
I suspect that memory is your least problem. Did you even try to run  
EMclust on a small subsample? I suspect that if you did, you would  
figure out that what you are trying to do is not likely to terminate  
within days...
> (1) I had initially thought that Windows 2000 should be able to  
> allocate up to about 2 GB memory.  So, why is there a problem to  
> allocate a little over 1GB on a defragmented disk with over 15 GB  
> free?  (Is this a pagefile size issue?)
Because that is not the only 1GB vector that is allocated. Your "15GB/ 
defragmented" are irrelevant - if at all, look how much virtual  
memory is set up in you system's preferences.
> (2) Do you think the origin of the problem is
>     (a) the R environment, or
>     (b) the function in the MCLUST package using an in-memory  
> instead of an on-disk approach?
Well, a toy example of 17000x2 needs 2.3GB and it's unlikely to  
terminate anytime soon, so I'd rather call it shooting with the wrong  
gun. Maybe you should consider different approach to your problem -  
possibly ask at the BioConductor list, because people there have more  
experience with large data and this is not really a technical  
question about R, but rather how to apply statistical methods.
> (3)
>     (a) If the problem originates in the R environment, would  
> switching to the Linux version of R solve the problem?
Any reasonable unix will do - technically (64-bit versions  
preferably, but in your case even 32-bit would do). Again, I don't  
think memory is your only problem here, though.

Cheers,
Simon

Karen.Green@sanofi-aventis.com

2006-Jan-06 00:36 UTC

head link

[Rd] Q: R 2.2.1: Memory Management Issues?

Dear Simon,

Thank you for taking time to address my questions.
>> summary(EMclust(y),y)
>
>I suspect that memory is your least problem. Did you even try to run  
>EMclust on a small subsample? I suspect that if you did, you would  
>figure out that what you are trying to do is not likely to terminate  
>within days...
The empirically derived limit on my machine (under R 1.9.1) was approximately
7500 data points.
I have been able to successfully run the script that uses package MCLUST on
several hundred smaller data sets.

I even had written a work-around for the case of greater than 9600 data points
(the limit when using R 2.2.1).  My work-around first orders the points by their
value then takes a sample (e.g. every other point or 1 point every n points) in
order to bring the number under 9600.  No problems with the computations were
observed, but you are correct that a deconvolution on that larger dataset of
9600 takes almost 30 minutes.  However, for our purposes, we do not have many
datasets over 9600 so the time is not a major constraint.

Unfortunately, my management does not like using a work-around and really wants
to operate on the larger data sets.
I was told to find a way to make it operate on the larger data sets or avoid
using R and find another solution.
>From previous programming projects in a different scientific field long ago,
I recall making a trade-off of using temp files instead of holding data in
memory in order to make working with larger data sets possible.  I am wondering
if something like that would be possible for this situation, but I don't
have enough knowledge at this moment to make this decision.
Karen
---
Karen M. Green, Ph.D.
Karen.Green at sanofi-aventis.com
Research Investigator
Drug Design Group
Sanofi Aventis Pharmaceuticals
Tucson, AZ  85737

-----Original Message-----
From: Simon Urbanek [mailto:simon.urbanek at r-project.org]
Sent: Thursday, January 05, 2006 5:13 PM
To: Green, Karen M. PH/US
Cc: R-devel at stat.math.ethz.ch
Subject: Re: [Rd] Q: R 2.2.1: Memory Management Issues?
Importance: High


Karen,

On Jan 5, 2006, at 5:18 PM, <Karen.Green at sanofi-aventis.com>  
<Karen.Green at sanofi-aventis.com> wrote:
> I am trying to run a R script which makes use of the MCLUST package.
> The script can successfully read in the approximately 17000 data  
> points ok, but then throws an error:
> --------------------------------------------------------
> Error:  cannot allocate vector of size 1115070Kb
This is 1.1GB of RAM to allocate alone for one vector(!). As you  
stated yourself the total upper limit is 2GB, so you cannot even fit  
two of those in memory anyway - not much you can do with it even if  
it is allocated.
> summary(EMclust(y),y)
I suspect that memory is your least problem. Did you even try to run  
EMclust on a small subsample? I suspect that if you did, you would  
figure out that what you are trying to do is not likely to terminate  
within days...
> (1) I had initially thought that Windows 2000 should be able to  
> allocate up to about 2 GB memory.  So, why is there a problem to  
> allocate a little over 1GB on a defragmented disk with over 15 GB  
> free?  (Is this a pagefile size issue?)
Because that is not the only 1GB vector that is allocated. Your "15GB/ 
defragmented" are irrelevant - if at all, look how much virtual  
memory is set up in you system's preferences.
> (2) Do you think the origin of the problem is
>     (a) the R environment, or
>     (b) the function in the MCLUST package using an in-memory  
> instead of an on-disk approach?
Well, a toy example of 17000x2 needs 2.3GB and it's unlikely to  
terminate anytime soon, so I'd rather call it shooting with the wrong  
gun. Maybe you should consider different approach to your problem -  
possibly ask at the BioConductor list, because people there have more  
experience with large data and this is not really a technical  
question about R, but rather how to apply statistical methods.
> (3)
>     (a) If the problem originates in the R environment, would  
> switching to the Linux version of R solve the problem?
Any reasonable unix will do - technically (64-bit versions  
preferably, but in your case even 32-bit would do). Again, I don't  
think memory is your only problem here, though.

Cheers,
Simon

Prof Brian Ripley

2006-Jan-06 08:44 UTC

head link

[Rd] Q: R 2.2.1: Memory Management Issues?

On Thu, 5 Jan 2006, Simon Urbanek wrote:
> Karen,
>
> On Jan 5, 2006, at 5:18 PM, <Karen.Green at sanofi-aventis.com>
> <Karen.Green at sanofi-aventis.com> wrote:
>
>> I am trying to run a R script which makes use of the MCLUST package.
>> The script can successfully read in the approximately 17000 data
>> points ok, but then throws an error:
>> --------------------------------------------------------
>> Error:  cannot allocate vector of size 1115070Kb
>
> This is 1.1GB of RAM to allocate alone for one vector(!). As you
> stated yourself the total upper limit is 2GB, so you cannot even fit
> two of those in memory anyway - not much you can do with it even if
> it is allocated.
Just in case people missed this (Simon as a MacOS user has no reason to 
know this), the Windows limit is in fact 3Gb if you tell your OS to allow 
it.  (How is in the quoted rw-FAQ, Q2.9, and from 2.2.1 R will 
automatically notice this whereas earlier versions needed to be told.)

However, there is another problem with a 32-bit OS:  you can only fit 2 
1.1Gb objects in a 3Gb address space if they are in specific positions, 
and fragmentation is often a big problem.

I believe a 64-bit OS with 4Gb of RAM would handle such problems much 
more comfortably.  The alternative is to find (or write) more efficient 
mixture-fitting software than mclust.

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

Seemingly Similar Threads

Search for more possibly parallel threads

R devel - Jan 2006 - Q: R 2.2.1: Memory Management Issues?

[Rd] Q: R 2.2.1: Memory Management Issues?

[Rd] Q: R 2.2.1: Memory Management Issues?

[Rd] Q: R 2.2.1: Memory Management Issues?

[Rd] Q: R 2.2.1: Memory Management Issues?

[Rd] Q: R 2.2.1: Memory Management Issues?

Seemingly Similar Threads