thr3ads.net - R help - [R] Handling large data sets via scan() [Feb 2005]

If this information is useful, please help other people find it:
Share via:

Nawaaz Ahmed

2005-Feb-04 06:40 UTC

[R] Handling large data sets via scan()

I'm trying to read in datasets with roughly 150,000 rows and 600
features. I wrote a function using scan() to read it in (I have a 4GB
linux machine) and it works like a charm.  Unfortunately, converting the
scanned list into a datafame using as.data.frame() causes the memory
usage to explode (it can go from 300MB for the scanned list to 1.4GB for
a data.frame of 30000 rows) and it fails claiming it cannot allocate
memory (though it is still not close to the 3GB limit per process on my
linux box - the message is "unable to allocate vector of size 522K"). 

So I have three questions --

1) Why is it failing even though there seems to be enough memory available?

2) Why is converting it into a data.frame causing the memory usage to
explode? Am I using as.data.frame() wrongly? Should I be using some
other command?

3) All the model fitting packages seem to want to use data.frames as
their input. If I cannot convert my list into a data.frame what can I
do? Is there any way of getting around this?

Much thanks!
Nawaaz

Mulholland, Tom

2005-Feb-04 07:26 UTC

head link

[R] Handling large data sets via scan()

I'm sure others with more experience will answer this, but for what it is
worth my experience suggests that memory issues are more often with the user and
not the machine. I don't use Linux so I can't make specific comments
about the capacity of your machine. However it appears that there is often a
need for a copy of an object to be in memory while you are working on creating a
new version. So if you can get a data.frame to be 1.4Gb it wouldn't leave
much space if there needed to be an original and a copy for any reason. (I
speculate that this may be the case rather than asserting it is the case.)
>From a practical point of view I assume that when you say you have 600
features that you are not going to use each and every one in the models that you
may generate. So is it practical to limit the features to those that you wish to
use before creating a data.frame?
In short if you really do need to work this way I suggest that you read as many
of the frequent posts on memory issues until you are either fully conversant
with memory issues with the machine you have or you have found one of the many
suggestions to work around this issue, such as working with a database and sql.
Using "large dataset" as a query on Jonathon Baron's website gave
over 400 hits. http://finzi.psych.upenn.edu/nmz.html

Tom
> -----Original Message-----
> From: Nawaaz Ahmed [mailto:nawaaz at inktomi.com]
> Sent: Friday, 4 February 2005 2:40 PM
> To: R-help at stat.math.ethz.ch
> Cc: nawaaz at yahoo-inc.com
> Subject: [R] Handling large data sets via scan()
> 
> 
> I'm trying to read in datasets with roughly 150,000 rows and 600
> features. I wrote a function using scan() to read it in (I have a 4GB
> linux machine) and it works like a charm.  Unfortunately, 
> converting the
> scanned list into a datafame using as.data.frame() causes the memory
> usage to explode (it can go from 300MB for the scanned list 
> to 1.4GB for
> a data.frame of 30000 rows) and it fails claiming it cannot allocate
> memory (though it is still not close to the 3GB limit per 
> process on my
> linux box - the message is "unable to allocate vector of size
522K").
> 
> So I have three questions --
> 
> 1) Why is it failing even though there seems to be enough 
> memory available?
> 
> 2) Why is converting it into a data.frame causing the memory usage to
> explode? Am I using as.data.frame() wrongly? Should I be using some
> other command?
> 
> 3) All the model fitting packages seem to want to use data.frames as
> their input. If I cannot convert my list into a data.frame what can I
> do? Is there any way of getting around this?
> 
> Much thanks!
> Nawaaz
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Christoph Lehmann

2005-Feb-04 09:28 UTC

head link

[R] Handling large data sets via scan()

does it solve to a part your problem, if you use read.table() instead of 
scan, since it imports data directly to a data.frame?

let me know, if it helps

Nawaaz Ahmed wrote:> I'm trying to read in datasets with roughly 150,000 rows and 600
> features. I wrote a function using scan() to read it in (I have a 4GB
> linux machine) and it works like a charm.  Unfortunately, converting the
> scanned list into a datafame using as.data.frame() causes the memory
> usage to explode (it can go from 300MB for the scanned list to 1.4GB for
> a data.frame of 30000 rows) and it fails claiming it cannot allocate
> memory (though it is still not close to the 3GB limit per process on my
> linux box - the message is "unable to allocate vector of size
522K").
> 
> So I have three questions --
> 
> 1) Why is it failing even though there seems to be enough memory available?
> 
> 2) Why is converting it into a data.frame causing the memory usage to
> explode? Am I using as.data.frame() wrongly? Should I be using some
> other command?
> 
> 3) All the model fitting packages seem to want to use data.frames as
> their input. If I cannot convert my list into a data.frame what can I
> do? Is there any way of getting around this?
> 
> Much thanks!
> Nawaaz
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html
> 
>

Roger D. Peng

2005-Feb-04 13:36 UTC

head link

[R] Handling large data sets via scan()

I can usually read in large tables by very careful usage of 
read.table() without having to resort to scan().  In particular, using 
the `colClasses', `nrows', and `comment.char' arguments correctly
can
greatly reduce memory usage (and increase speed) when reading in data.

Converting from a list to a data frame likely requires at least two 
copies of the data being stored in memory.  Also, are you using a 
64-bit operating system?

-roger

Nawaaz Ahmed wrote:> I'm trying to read in datasets with roughly 150,000 rows and 600
> features. I wrote a function using scan() to read it in (I have a 4GB
> linux machine) and it works like a charm.  Unfortunately, converting the
> scanned list into a datafame using as.data.frame() causes the memory
> usage to explode (it can go from 300MB for the scanned list to 1.4GB for
> a data.frame of 30000 rows) and it fails claiming it cannot allocate
> memory (though it is still not close to the 3GB limit per process on my
> linux box - the message is "unable to allocate vector of size
522K").
> 
> So I have three questions --
> 
> 1) Why is it failing even though there seems to be enough memory available?
> 
> 2) Why is converting it into a data.frame causing the memory usage to
> explode? Am I using as.data.frame() wrongly? Should I be using some
> other command?
> 
> 3) All the model fitting packages seem to want to use data.frames as
> their input. If I cannot convert my list into a data.frame what can I
> do? Is there any way of getting around this?
> 
> Much thanks!
> Nawaaz
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html
> 
-- 
Roger D. Peng
http://www.biostat.jhsph.edu/~rpeng/

Possibly Parallel Threads

Search for more apparently analagous threads

R help - Feb 2005 - Handling large data sets via scan()

[R] Handling large data sets via scan()

[R] Handling large data sets via scan()

[R] Handling large data sets via scan()

[R] Handling large data sets via scan()

Possibly Parallel Threads