Hi, Here is what I want to do. I have a dataset containing 4.2 *million* rows and about 10 columns and want to do some statistics with it, mainly using it as a prediction set for GAM and GLM models. I tried to load it from a csv file but, after filling up memory and part of the swap (1 gb each), I get a segmentation fault and R stops. I use R under Linux. Here are my questions : 1) Has anyone ever tried to use such a big dataset? 2) Do you think that it would possible on a more powerfull machine, such as a cluster of computers? 3) Finaly, does R has some "memory limitation" or does it just depend on the machine I'm using? Best wishes Fabien Fivaz
A matrix of that size takes up just over 320MB to store in memory. I'd imagine you probably can do it with 2GB physical RAM (assuming your `columns' are all numeric variables; i.e., no factors). However, perhaps better way than the brute-force, one-shot way, is to read in the data in chunks and do the prediction piece by piece. You can use scan(), or open()/readLines()/close() to do this fairly easily. My understanding of how (most) clusters work is that you need at least one node that will accommodate the memory load for the monolithic R process, so probably not much help. (I could very well be wrong about this. If so, I'd be very grateful for correction.) HTH, Andy> From: Fabien Fivaz > > Hi, > > Here is what I want to do. I have a dataset containing 4.2 *million* > rows and about 10 columns and want to do some statistics with > it, mainly > using it as a prediction set for GAM and GLM models. I tried > to load it > from a csv file but, after filling up memory and part of the > swap (1 gb > each), I get a segmentation fault and R stops. I use R under > Linux. Here > are my questions : > > 1) Has anyone ever tried to use such a big dataset? > 2) Do you think that it would possible on a more powerfull > machine, such > as a cluster of computers? > 3) Finaly, does R has some "memory limitation" or does it > just depend on > the machine I'm using? > > Best wishes > > Fabien Fivaz >------------------------------------------------------------------------------ Notice: This e-mail message, together with any attachments,...{{dropped}}
By my calculation, your dataset should occupy less than 400MB of RAM, so this is not a terribly large dataset (these days). But that's not including any possible attributes (like row names) which often also take up a lot of memory. Considering that a function like read.csv() makes a copy of the dataset your actual requirements are ~800MB, which for a 1GB machine may be too big depending on what else the computer is doing. I have successfully loaded *much* bigger datasets into R (2-4GB) without a problem. Some possible solutions are 1. Buy more RAM 2. Use scan(), which doesn't make a copy of the dataset 3. Use a 64-bit machine and buy even more RAM. Using a cluster of computers doesn't really help in this situation because there's no easy way to spread a dataset across multiple machines. So you will still be limited by the memory on a single machine. As far as I know, R does not have a "memory limitation" -- the only limit is the memory installed on your computer. -roger Fabien Fivaz wrote:> Hi, > > Here is what I want to do. I have a dataset containing 4.2 *million* > rows and about 10 columns and want to do some statistics with it, mainly > using it as a prediction set for GAM and GLM models. I tried to load it > from a csv file but, after filling up memory and part of the swap (1 gb > each), I get a segmentation fault and R stops. I use R under Linux. Here > are my questions : > > 1) Has anyone ever tried to use such a big dataset? > 2) Do you think that it would possible on a more powerfull machine, such > as a cluster of computers? > 3) Finaly, does R has some "memory limitation" or does it just depend on > the machine I'm using? > > Best wishes > > Fabien Fivaz > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://www.stat.math.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html >
You were all right. My data, when I load it with scan() just takes about 300MB of memory and I do not have any problem with it. When loaded with scan, it is not yet a matrix, and I can easily convert it to a matrix with matrix(blabla). The problem I have is that I have to convert it to a data frame (I have a mix of numbers and factors). It takes some time but it's OK. But I cannot read or work with the created data frame, it always ends with a seg fault ! I *just* did variable[1] (where variable is the name of my variable :-)), and it returned a seg fault. Why is there such a difference between matrices and data frames? Is it because data frames store much more informations? Best wishes, Fabien -------- Message d'origine-------- De: Liaw, Andy [mailto:andy_liaw at merck.com] Date: mer. 04.02.2004 17:11 ?: FIVAZ Fabien; r-help at stat.math.ethz.ch Cc: Objet: RE: [R] Using huge datasets A matrix of that size takes up just over 320MB to store in memory. I'd imagine you probably can do it with 2GB physical RAM (assuming your `columns' are all numeric variables; i.e., no factors). However, perhaps better way than the brute-force, one-shot way, is to read in the data in chunks and do the prediction piece by piece. You can use scan(), or open()/readLines()/close() to do this fairly easily. My understanding of how (most) clusters work is that you need at least one node that will accommodate the memory load for the monolithic R process, so probably not much help. (I could very well be wrong about this. If so, I'd be very grateful for correction.) HTH, Andy> From: Fabien Fivaz > > Hi, > > Here is what I want to do. I have a dataset containing 4.2 *million* > rows and about 10 columns and want to do some statistics with > it, mainly > using it as a prediction set for GAM and GLM models. I tried > to load it > from a csv file but, after filling up memory and part of the > swap (1 gb > each), I get a segmentation fault and R stops. I use R under > Linux. Here > are my questions : > > 1) Has anyone ever tried to use such a big dataset? > 2) Do you think that it would possible on a more powerfull > machine, such > as a cluster of computers? > 3) Finaly, does R has some "memory limitation" or does it > just depend on > the machine I'm using? > > Best wishes > > Fabien Fivaz >------------------------------------------------------------------------------ Notice: This e-mail message, together with any attachments,...{{dropped}}