thr3ads.net - R help - [R] Using huge datasets [Feb 2004]

If this information is useful, please help other people find it:
Share via:

Fabien Fivaz

2004-Feb-04 13:51 UTC

[R] Using huge datasets

Hi,

Here is what I want to do. I have a dataset containing 4.2 *million* 
rows and about 10 columns and want to do some statistics with it, mainly 
using it as a prediction set for GAM and GLM models. I tried to load it 
from a csv file but, after filling up memory and part of the swap (1 gb 
each), I get a segmentation fault and R stops. I use R under Linux. Here 
are my questions :

1) Has anyone ever tried to use such a big dataset?
2) Do you think that it would possible on a more powerfull machine, such 
as a cluster of computers?
3) Finaly, does R has some "memory limitation" or does it just depend
on
the machine I'm using?

Best wishes

Fabien Fivaz

Liaw, Andy

2004-Feb-04 16:11 UTC

head link

[R] Using huge datasets

A matrix of that size takes up just over 320MB to store in memory.  I'd
imagine you probably can do it with 2GB physical RAM (assuming your
`columns' are all numeric variables; i.e., no factors).

However, perhaps better way than the brute-force, one-shot way, is to read
in the data in chunks and do the prediction piece by piece.  You can use
scan(), or open()/readLines()/close() to do this fairly easily.

My understanding of how (most) clusters work is that you need at least one
node that will accommodate the memory load for the monolithic R process, so
probably not much help.  (I could very well be wrong about this.  If so, I'd
be very grateful for correction.)

HTH,
Andy
> From: Fabien Fivaz
> 
> Hi,
> 
> Here is what I want to do. I have a dataset containing 4.2 *million* 
> rows and about 10 columns and want to do some statistics with 
> it, mainly 
> using it as a prediction set for GAM and GLM models. I tried 
> to load it 
> from a csv file but, after filling up memory and part of the 
> swap (1 gb 
> each), I get a segmentation fault and R stops. I use R under 
> Linux. Here 
> are my questions :
> 
> 1) Has anyone ever tried to use such a big dataset?
> 2) Do you think that it would possible on a more powerfull 
> machine, such 
> as a cluster of computers?
> 3) Finaly, does R has some "memory limitation" or does it 
> just depend on 
> the machine I'm using?
> 
> Best wishes
> 
> Fabien Fivaz
>  


------------------------------------------------------------------------------
Notice:  This e-mail message, together with any attachments,...{{dropped}}

Roger D. Peng

2004-Feb-04 16:18 UTC

head link

[R] Using huge datasets

By my calculation, your dataset should occupy less than 
400MB of RAM, so this is not a terribly large dataset (these 
days).  But that's not including any possible attributes 
(like row names) which often also take up a lot of memory. 
Considering that a function like read.csv() makes a copy of 
the dataset your actual requirements are ~800MB, which for a 
1GB machine may be too big depending on what else the 
computer is doing.  I have successfully loaded *much* bigger 
datasets into R (2-4GB) without a problem.

Some possible solutions are

1. Buy more RAM
2. Use scan(), which doesn't make a copy of the dataset
3. Use a 64-bit machine and buy even more RAM.

Using a cluster of computers doesn't really help in this 
situation because there's no easy way to spread a dataset 
across multiple machines.  So you will still be limited by 
the memory on a single machine.

As far as I know, R does not have a "memory limitation" -- 
the only limit is the memory installed on your computer.

-roger

Fabien Fivaz wrote:> Hi,
> 
> Here is what I want to do. I have a dataset containing 4.2 *million* 
> rows and about 10 columns and want to do some statistics with it, mainly 
> using it as a prediction set for GAM and GLM models. I tried to load it 
> from a csv file but, after filling up memory and part of the swap (1 gb 
> each), I get a segmentation fault and R stops. I use R under Linux. Here 
> are my questions :
> 
> 1) Has anyone ever tried to use such a big dataset?
> 2) Do you think that it would possible on a more powerfull machine, such 
> as a cluster of computers?
> 3) Finaly, does R has some "memory limitation" or does it just
depend on
> the machine I'm using?
> 
> Best wishes
> 
> Fabien Fivaz
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://www.stat.math.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html
>

FIVAZ Fabien

2004-Feb-04 17:30 UTC

head link

[R] Using huge datasets

You were all right. My data, when I load it with scan() just takes about 300MB
of memory and I do not have any problem with it. When loaded with scan, it is
not yet a matrix, and I can easily convert it to a matrix with matrix(blabla).
The problem I have is that I have to convert it to a data frame (I have a mix of
numbers and factors). It takes some time but it's OK. But I cannot read or
work with the created data frame, it always ends with a seg fault ! I *just* did
variable[1] (where variable is the name of my variable :-)), and it returned a
seg fault.

Why is there such a difference between matrices and data frames? Is it because
data frames store much more informations?

Best wishes, Fabien


-------- Message d'origine--------
De:	Liaw, Andy [mailto:andy_liaw at merck.com]
Date:	mer. 04.02.2004 17:11
?:	FIVAZ Fabien; r-help at stat.math.ethz.ch
Cc:	
Objet:	RE: [R] Using huge datasets
A matrix of that size takes up just over 320MB to store in memory.  I'd
imagine you probably can do it with 2GB physical RAM (assuming your
`columns' are all numeric variables; i.e., no factors).

However, perhaps better way than the brute-force, one-shot way, is to read
in the data in chunks and do the prediction piece by piece.  You can use
scan(), or open()/readLines()/close() to do this fairly easily.

My understanding of how (most) clusters work is that you need at least one
node that will accommodate the memory load for the monolithic R process, so
probably not much help.  (I could very well be wrong about this.  If so, I'd
be very grateful for correction.)

HTH,
Andy
> From: Fabien Fivaz
> 
> Hi,
> 
> Here is what I want to do. I have a dataset containing 4.2 *million* 
> rows and about 10 columns and want to do some statistics with 
> it, mainly 
> using it as a prediction set for GAM and GLM models. I tried 
> to load it 
> from a csv file but, after filling up memory and part of the 
> swap (1 gb 
> each), I get a segmentation fault and R stops. I use R under 
> Linux. Here 
> are my questions :
> 
> 1) Has anyone ever tried to use such a big dataset?
> 2) Do you think that it would possible on a more powerfull 
> machine, such 
> as a cluster of computers?
> 3) Finaly, does R has some "memory limitation" or does it 
> just depend on 
> the machine I'm using?
> 
> Best wishes
> 
> Fabien Fivaz
>  


------------------------------------------------------------------------------
Notice:  This e-mail message, together with any attachments,...{{dropped}}

Apparently Analagous Threads

Search for more maybe matching threads

R help - Feb 2004 - Using huge datasets

[R] Using huge datasets

[R] Using huge datasets

[R] Using huge datasets

[R] Using huge datasets

Apparently Analagous Threads