Dear Listers: I have a question on handling large dataset. I searched R-Search and I hope I can get more information as to my specific case. First, my dataset has 1.7 billion observations and 350 variables, among which, 300 are float and 50 are integers. My system has 8 G memory, 64bit CPU, linux box. (currently, we don't plan to buy more memory).> R.version_ platform i686-redhat-linux-gnu arch i686 os linux-gnu system i686, linux-gnu status major 2 minor 1.1 year 2005 month 06 day 20 language R If I want to do some analysis for example like randomForest on a dataset, how many max observations can I load to get the machine run smoothly? After figuring out that number, I want to do some sampling first, but I did not find read.table or scan can do this. I guess I can load it into mysql and then use RMySQL do the sampling or use python to subset the data first. My question is, is there a way I can subsample directly from file just using R? Thanks, -- Weiwei Shi, Ph.D "Did you always know?" "No, I did not. But I believed..." ---Matrix III
I think the general advice is that around 1/4 or 1/3 of your available memory is about the largest data set that R can handle -- and often considerably less depending upon what you do and how you do it (because R's semantics require explicitly copying objects rather than passing pointers). Fancy tricks using environments might enable you to do better, but that requires advice from a true guru, which I ain't. See ?connections, ?scan, ?seek for reading in a file a chunk at a time from a connection, thus enabling you to sample one line of data from each chunk, say. I suppose you could do this directly with repeated calls to scan() or read.table() by skipping more and more lines at the beginning at each call, but I assume that is horridly inefficient and would take forever. HTH. -- Bert Gunter Genentech Non-Clinical Statistics South San Francisco, CA "The business of the statistician is to catalyze the scientific learning process." - George E. P. Box> -----Original Message----- > From: r-help-bounces at stat.math.ethz.ch > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Weiwei Shi > Sent: Thursday, October 27, 2005 9:28 AM > To: r-help > Subject: [R] memory problem in handling large dataset > > Dear Listers: > I have a question on handling large dataset. I searched R-Search and I > hope I can get more information as to my specific case. > > First, my dataset has 1.7 billion observations and 350 variables, > among which, 300 are float and 50 are integers. > My system has 8 G memory, 64bit CPU, linux box. (currently, we don't > plan to buy more memory). > > > R.version > _ > platform i686-redhat-linux-gnu > arch i686 > os linux-gnu > system i686, linux-gnu > status > major 2 > minor 1.1 > year 2005 > month 06 > day 20 > language R > > > If I want to do some analysis for example like randomForest on a > dataset, how many max observations can I load to get the machine run > smoothly? > > After figuring out that number, I want to do some sampling first, but > I did not find read.table or scan can do this. I guess I can load it > into mysql and then use RMySQL do the sampling or use python to subset > the data first. My question is, is there a way I can subsample > directly from file just using R? > > Thanks, > -- > Weiwei Shi, Ph.D > > "Did you always know?" > "No, I did not. But I believed..." > ---Matrix III > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html >
If my calculation is correct (very doubtful, sometimes), that's> 1.7e9 * (300 * 8 + 50 * 4) / 1024^3[1] 4116.446 or over 4 terabytes, just to store the data in memory. To sample rows and read that into R, Bert's suggestion of using connections, perhaps along with seek() for skipping ahead, would be what I'd try. I had try to do such things in Python as a chance to learn that language, but I found operationally it's easier to maintain the project by doing everything in one language, namely R, if possible. Andy> From: Berton Gunter > > I think the general advice is that around 1/4 or 1/3 of your available > memory is about the largest data set that R can handle -- and often > considerably less depending upon what you do and how you do > it (because R's > semantics require explicitly copying objects rather than > passing pointers). > Fancy tricks using environments might enable you to do > better, but that > requires advice from a true guru, which I ain't. > > See ?connections, ?scan, ?seek for reading in a file a chunk > at a time from > a connection, thus enabling you to sample one line of data > from each chunk, > say. > > I suppose you could do this directly with repeated calls to scan() or > read.table() by skipping more and more lines at the beginning > at each call, > but I assume that is horridly inefficient and would take forever. > > HTH. > > -- Bert Gunter > Genentech Non-Clinical Statistics > South San Francisco, CA > > "The business of the statistician is to catalyze the > scientific learning > process." - George E. P. Box > > > > > -----Original Message----- > > From: r-help-bounces at stat.math.ethz.ch > > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Weiwei Shi > > Sent: Thursday, October 27, 2005 9:28 AM > > To: r-help > > Subject: [R] memory problem in handling large dataset > > > > Dear Listers: > > I have a question on handling large dataset. I searched > R-Search and I > > hope I can get more information as to my specific case. > > > > First, my dataset has 1.7 billion observations and 350 variables, > > among which, 300 are float and 50 are integers. > > My system has 8 G memory, 64bit CPU, linux box. (currently, we don't > > plan to buy more memory). > > > > > R.version > > _ > > platform i686-redhat-linux-gnu > > arch i686 > > os linux-gnu > > system i686, linux-gnu > > status > > major 2 > > minor 1.1 > > year 2005 > > month 06 > > day 20 > > language R > > > > > > If I want to do some analysis for example like randomForest on a > > dataset, how many max observations can I load to get the machine run > > smoothly? > > > > After figuring out that number, I want to do some sampling > first, but > > I did not find read.table or scan can do this. I guess I can load it > > into mysql and then use RMySQL do the sampling or use > python to subset > > the data first. My question is, is there a way I can subsample > > directly from file just using R? > > > > Thanks, > > -- > > Weiwei Shi, Ph.D > > > > "Did you always know?" > > "No, I did not. But I believed..." > > ---Matrix III > > > > ______________________________________________ > > R-help at stat.math.ethz.ch mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide! > > http://www.R-project.org/posting-guide.html > > > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html > >
Hi, Jim: Thanks for the calculation. I think you won't mind if I cc the reply to r-help too so that I can get more info. I assume you use 4 bytes for integer and 8 bytes for float, so 300x8+50x4=2600 bytes for each observation, right? I wish I could have 500x8 G memory :) just kidding.. definately, sampling will be proceeded as the first step. Some feature selections (filtering, mainly) will be applied. Accepting Berton's suggestion, I will probably use python to do the sampling since whenever I have some "slow" situations like this, python never fails me. (I am not saying R is bad though) I understand "I get what I pay" here. But more information or experience on R's handling large dataset (like using RMySQL) will be appreciated. regards, Weiwei On 10/27/05, jim holtman <jholtman at gmail.com> wrote:> Based on the numbers that you gave, if you wanted all the data in memory at > once, you would need 4.4TB of memory, about 500X what you currently have. > Each of you observation will require about 2,600 bytes of memory. You > probably don't want to have more than 25% for a single object since many of > the algorithms make copies. This would limit you to about 700,000 > observations at a time for processing. > > The real question is what are you trying to do with the data. Can you > partition the data and do analysis on the subsets? > > > On 10/27/05, Weiwei Shi <helprhelp at gmail.com> wrote: > > > > Dear Listers: > > I have a question on handling large dataset. I searched R-Search and I > > hope I can get more information as to my specific case. > > > > First, my dataset has 1.7 billion observations and 350 variables, > > among which, 300 are float and 50 are integers. > > My system has 8 G memory, 64bit CPU, linux box. (currently, we don't > > plan to buy more memory). > > > > > R.version > > _ > > platform i686-redhat-linux-gnu > > arch i686 > > os linux-gnu > > system i686, linux-gnu > > status > > major 2 > > minor 1.1 > > year 2005 > > month 06 > > day 20 > > language R > > > > > > If I want to do some analysis for example like randomForest on a > > dataset, how many max observations can I load to get the machine run > > smoothly? > > > > After figuring out that number, I want to do some sampling first, but > > I did not find read.table or scan can do this. I guess I can load it > > into mysql and then use RMySQL do the sampling or use python to subset > > the data first. My question is, is there a way I can subsample > > directly from file just using R? > > > > Thanks, > > -- > > Weiwei Shi, Ph.D > > > > "Did you always know?" > > "No, I did not. But I believed..." > > ---Matrix III > > > > ______________________________________________ > > R-help at stat.math.ethz.ch mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html > > > > > > -- > Jim Holtman > Cincinnati, OH > +1 513 247 0281 > > What the problem you are trying to solve?-- Weiwei Shi, Ph.D "Did you always know?" "No, I did not. But I believed..." ---Matrix III