I'm trying to read in datasets with roughly 150,000 rows and 600 features. I wrote a function using scan() to read it in (I have a 4GB linux machine) and it works like a charm. Unfortunately, converting the scanned list into a datafame using as.data.frame() causes the memory usage to explode (it can go from 300MB for the scanned list to 1.4GB for a data.frame of 30000 rows) and it fails claiming it cannot allocate memory (though it is still not close to the 3GB limit per process on my linux box - the message is "unable to allocate vector of size 522K"). So I have three questions -- 1) Why is it failing even though there seems to be enough memory available? 2) Why is converting it into a data.frame causing the memory usage to explode? Am I using as.data.frame() wrongly? Should I be using some other command? 3) All the model fitting packages seem to want to use data.frames as their input. If I cannot convert my list into a data.frame what can I do? Is there any way of getting around this? Much thanks! Nawaaz
I'm sure others with more experience will answer this, but for what it is worth my experience suggests that memory issues are more often with the user and not the machine. I don't use Linux so I can't make specific comments about the capacity of your machine. However it appears that there is often a need for a copy of an object to be in memory while you are working on creating a new version. So if you can get a data.frame to be 1.4Gb it wouldn't leave much space if there needed to be an original and a copy for any reason. (I speculate that this may be the case rather than asserting it is the case.)>From a practical point of view I assume that when you say you have 600 features that you are not going to use each and every one in the models that you may generate. So is it practical to limit the features to those that you wish to use before creating a data.frame?In short if you really do need to work this way I suggest that you read as many of the frequent posts on memory issues until you are either fully conversant with memory issues with the machine you have or you have found one of the many suggestions to work around this issue, such as working with a database and sql. Using "large dataset" as a query on Jonathon Baron's website gave over 400 hits. http://finzi.psych.upenn.edu/nmz.html Tom> -----Original Message----- > From: Nawaaz Ahmed [mailto:nawaaz at inktomi.com] > Sent: Friday, 4 February 2005 2:40 PM > To: R-help at stat.math.ethz.ch > Cc: nawaaz at yahoo-inc.com > Subject: [R] Handling large data sets via scan() > > > I'm trying to read in datasets with roughly 150,000 rows and 600 > features. I wrote a function using scan() to read it in (I have a 4GB > linux machine) and it works like a charm. Unfortunately, > converting the > scanned list into a datafame using as.data.frame() causes the memory > usage to explode (it can go from 300MB for the scanned list > to 1.4GB for > a data.frame of 30000 rows) and it fails claiming it cannot allocate > memory (though it is still not close to the 3GB limit per > process on my > linux box - the message is "unable to allocate vector of size 522K"). > > So I have three questions -- > > 1) Why is it failing even though there seems to be enough > memory available? > > 2) Why is converting it into a data.frame causing the memory usage to > explode? Am I using as.data.frame() wrongly? Should I be using some > other command? > > 3) All the model fitting packages seem to want to use data.frames as > their input. If I cannot convert my list into a data.frame what can I > do? Is there any way of getting around this? > > Much thanks! > Nawaaz > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide!http://www.R-project.org/posting-guide.html
does it solve to a part your problem, if you use read.table() instead of scan, since it imports data directly to a data.frame? let me know, if it helps Nawaaz Ahmed wrote:> I'm trying to read in datasets with roughly 150,000 rows and 600 > features. I wrote a function using scan() to read it in (I have a 4GB > linux machine) and it works like a charm. Unfortunately, converting the > scanned list into a datafame using as.data.frame() causes the memory > usage to explode (it can go from 300MB for the scanned list to 1.4GB for > a data.frame of 30000 rows) and it fails claiming it cannot allocate > memory (though it is still not close to the 3GB limit per process on my > linux box - the message is "unable to allocate vector of size 522K"). > > So I have three questions -- > > 1) Why is it failing even though there seems to be enough memory available? > > 2) Why is converting it into a data.frame causing the memory usage to > explode? Am I using as.data.frame() wrongly? Should I be using some > other command? > > 3) All the model fitting packages seem to want to use data.frames as > their input. If I cannot convert my list into a data.frame what can I > do? Is there any way of getting around this? > > Much thanks! > Nawaaz > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html > >
I can usually read in large tables by very careful usage of read.table() without having to resort to scan(). In particular, using the `colClasses', `nrows', and `comment.char' arguments correctly can greatly reduce memory usage (and increase speed) when reading in data. Converting from a list to a data frame likely requires at least two copies of the data being stored in memory. Also, are you using a 64-bit operating system? -roger Nawaaz Ahmed wrote:> I'm trying to read in datasets with roughly 150,000 rows and 600 > features. I wrote a function using scan() to read it in (I have a 4GB > linux machine) and it works like a charm. Unfortunately, converting the > scanned list into a datafame using as.data.frame() causes the memory > usage to explode (it can go from 300MB for the scanned list to 1.4GB for > a data.frame of 30000 rows) and it fails claiming it cannot allocate > memory (though it is still not close to the 3GB limit per process on my > linux box - the message is "unable to allocate vector of size 522K"). > > So I have three questions -- > > 1) Why is it failing even though there seems to be enough memory available? > > 2) Why is converting it into a data.frame causing the memory usage to > explode? Am I using as.data.frame() wrongly? Should I be using some > other command? > > 3) All the model fitting packages seem to want to use data.frames as > their input. If I cannot convert my list into a data.frame what can I > do? Is there any way of getting around this? > > Much thanks! > Nawaaz > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html >-- Roger D. Peng http://www.biostat.jhsph.edu/~rpeng/