thr3ads.net - R help - [R] the large dataset problem [Jul 2007]

If this information is useful, please help other people find it:
Share via:

Eric Doviak

2007-Jul-28 18:07 UTC

[R] the large dataset problem

Dear useRs,

I recently began a job at a very large and heavily bureaucratic organization.
We're setting up a research office and statistical analysis will form the
backbone of our work. We'll be working with large datasets such the SIPP as
well as our own administrative data.

Due to the bureaucracy, it will take some time to get the licenses for
proprietary software like Stata. Right now, R is the only statistical software
package on my computer.

This, of course, is a huge limitation because R loads data directly into RAM
making it difficult (if not impossible) to work with large datasets. My computer
only has 1000 MB of RAM, of which Microsucks Winblows devours 400 MB. To make my
memory issues even worse, my computer has a virus scanner that runs everyday and
I do not have the administrative rights to turn the damn thing off.

I need to find some way to overcome these constraints and work with large
datasets. Does anyone have any suggestions?

I've read that I should "carefully vectorize my code." What does
that mean ??? !!!

The "Introduction to R" manual suggests modifying input files with
Perl. Any tips on how to get started? Would Perl Data Language (PDL) be a good
choice?  http://pdl.perl.org/index_en.html

I wrote a script which loads large datasets a few lines at a time, writes the
dozen or so variables of interest to a CSV file, removes the loaded data and
then (via a "for" loop) loads the next few lines .... I managed to get
it to work with one of the SIPP core files, but it's SLOOOOW. Worse, if I
discover later that I omitted a relevant variable, then I'll have to run the
whole script all over again.

Any suggestions?

Thanks,
- Eric

Eric Doviak

2007-Jul-30 11:40 UTC

head link

[R] the large dataset problem

Dear useRs,

I recently began a job at a very large and heavily bureaucratic organization.
We're setting up a research office and statistical analysis will form the
backbone of our work. We'll be working with large datasets such the SIPP as
well as our own administrative data.

Due to the bureaucracy, it will take some time to get the licenses for
proprietary software like Stata. Right now, R is the only statistical software
package on my computer.

This, of course, is a huge limitation because R loads data directly into RAM
making it difficult (if not impossible) to work with large datasets. My computer
only has 1000 MB of RAM, of which Microsucks Winblows devours 400 MB. To make my
memory issues even worse, my computer has a virus scanner that runs everyday and
I do not have the administrative rights to turn the damn thing off.

I need to find some way to overcome these constraints and work with large
datasets. Does anyone have any suggestions?

I've read that I should "carefully vectorize my code." What does
that mean ??? !!!

The "Introduction to R" manual suggests modifying input files with
Perl. Any tips on how to get started? Would Perl Data Language (PDL) be a good
choice?  http://pdl.perl.org/index_en.html

I wrote a script which loads large datasets a few lines at a time, writes the
dozen or so variables of interest to a CSV file, removes the loaded data and
then (via a "for" loop) loads the next few lines .... I managed to get
it to work with one of the SIPP core files, but it's SLOOOOW. Worse, if I
discover later that I omitted a relevant variable, then I'll have to run the
whole script all over again.

Any suggestions?

Thanks,
- Eric

Bernzweig, Bruce (Consultant)

2007-Jul-30 15:46 UTC

head link

[R] the large dataset problem

Hi Eric,

I'm facing a similar problem.

Looking over the list of packages I came across:

 	R.huge: Methods for accessing huge amounts of data 
 	http://cran.r-project.org/src/contrib/Descriptions/R.huge.html

I haven't installed it yet so I don't know how well it works.  I
probably won't have time until next week at the earliest to look at it.

Would be interested in hearing your feedback if you do try it.

- Bruce

-----Original Message-----
From: r-help-bounces at stat.math.ethz.ch
[mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Eric Doviak
Sent: Saturday, July 28, 2007 2:08 PM
To: r-help at stat.math.ethz.ch
Subject: [R] the large dataset problem

Dear useRs,

I recently began a job at a very large and heavily bureaucratic
organization. We're setting up a research office and statistical
analysis will form the backbone of our work. We'll be working with large
datasets such the SIPP as well as our own administrative data.

Due to the bureaucracy, it will take some time to get the licenses for
proprietary software like Stata. Right now, R is the only statistical
software package on my computer. 

This, of course, is a huge limitation because R loads data directly into
RAM making it difficult (if not impossible) to work with large datasets.
My computer only has 1000 MB of RAM, of which Microsucks Winblows
devours 400 MB. To make my memory issues even worse, my computer has a
virus scanner that runs everyday and I do not have the administrative
rights to turn the damn thing off. 

I need to find some way to overcome these constraints and work with
large datasets. Does anyone have any suggestions?

I've read that I should "carefully vectorize my code." What does
that
mean ??? !!!

The "Introduction to R" manual suggests modifying input files with
Perl.
Any tips on how to get started? Would Perl Data Language (PDL) be a good
choice?  http://pdl.perl.org/index_en.html

I wrote a script which loads large datasets a few lines at a time,
writes the dozen or so variables of interest to a CSV file, removes the
loaded data and then (via a "for" loop) loads the next few lines ....
I
managed to get it to work with one of the SIPP core files, but it's
SLOOOOW. Worse, if I discover later that I omitted a relevant variable,
then I'll have to run the whole script all over again.

Any suggestions?

Thanks,
- Eric

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



**********************************************************************
Please be aware that, notwithstanding the fact that the pers...{{dropped}}

Ben Bolker

2007-Jul-30 16:42 UTC

head link

[R] the large dataset problem

Eric Doviak <edoviak <at> earthlink.net> writes:
> 
> Dear useRs,
> 
> I recently began a job at a very large and heavily bureaucratic
organization.
We're setting up a research> office and statistical analysis will form the backbone of our work.
We'll be
working with large datasets> such the SIPP as well as our own administrative data.  
  We need to know more about what you need to do with those
large data sets in order to help -- giving some specific
examples would be useful.  In many situations you can set up a database
connection or use Perl to select carefully and only load the
observations/variables you need into R, but it's hard to make
completely general suggestions.  

  I'm not sure what the purpose of your code to read a few
lines of a data file and write it to a CSV file is ... ?

  "Vectorizing" your code is figuring out a way to tell R
how to do what you want as a single 'vector' operation -- for
example to remove NAs from a vector you could do this:

newvec = numeric(0)
for (i in seq(along=oldvec)) {
  if (!is.na(oldvec[i])) newvec = c(newvec,oldvec[i])
}

but this would be incredibly slow --

newvec = oldvec[!is.na(oldvec)]

or

newvec = na.omit(oldvec)

would be far faster.

Greg Snow

2007-Jul-30 22:43 UTC

head link

[R] the large dataset problem

Check out the biglm package for some tools that may be useful.

-----Original Message-----
From: "Eric Doviak" <edoviak at earthlink.net>
To: "r-help at stat.math.ethz.ch" <r-help at stat.math.ethz.ch>
Sent: 7/30/07 9:54 AM
Subject: [R] the large dataset problem

Dear useRs,

I recently began a job at a very large and heavily bureaucratic organization.
We're setting up a research office and statistical analysis will form the
backbone of our work. We'll be working with large datasets such the SIPP as
well as our own administrative data.

Due to the bureaucracy, it will take some time to get the licenses for
proprietary software like Stata. Right now, R is the only statistical software
package on my computer.

This, of course, is a huge limitation because R loads data directly into RAM
making it difficult (if not impossible) to work with large datasets. My computer
only has 1000 MB of RAM, of which Microsucks Winblows devours 400 MB. To make my
memory issues even worse, my computer has a virus scanner that runs everyday and
I do not have the administrative rights to turn the damn thing off.

I need to find some way to overcome these constraints and work with large
datasets. Does anyone have any suggestions?

I've read that I should "carefully vectorize my code." What does
that mean ??? !!!

The "Introduction to R" manual suggests modifying input files with
Perl. Any tips on how to get started? Would Perl Data Language (PDL) be a good
choice?  http://pdl.perl.org/index_en.html

I wrote a script which loads large datasets a few lines at a time, writes the
dozen or so variables of interest to a CSV file, removes the loaded data and
then (via a "for" loop) loads the next few lines .... I managed to get
it to work with one of the SIPP core files, but it's SLOOOOW. Worse, if I
discover later that I omitted a relevant variable, then I'll have to run the
whole script all over again.

Any suggestions?

Thanks,
- Eric

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Eric Doviak

2007-Jul-31 11:22 UTC

head link

[R] the large dataset problem

Just a note of thanks for all the help I have received. I haven't gotten a
chance to implement any of your suggestions because I'm still trying to
catalog all of them! Thank you so much!

Just to recap (for my own benefit and to create a summary for others):

Bruce Bernzweig suggested using the  R.huge  package.

Ben Bolker pointed out that my original message wasn't clear and asked what
I want to do with the data. At this point, just getting a dataset loaded would
be wonderful, so I'm trying to trim variables (and if possible, I would also
like to trim observations). He also provided an example of
"vectorizing."

Ted Harding suggested that I use AWK to process the data and provided the
necessary code. He also tested his code on older hardware running GNU-Linux (or
Unix?) and showed that AWK can process the data even when the computer has very
little memory and processing power. Jim Holtman had similar success when he used
Cygwin's UNIX utilities on a machine running MS Windows. They both used the
following code:

     gawk 'BEGIN{FS=","}{print $(1) "," $(1000)
"," $(1275) ","  $(5678)}'
     < tempxx.txt > newdata.csv

Fortunately, there is a version of GAWK for MS Windows. ... Not that I like MS
Windows. It's just that I'm forced to use that 19th century operating
system on the job. (After using Debian at home and happily running RKWard for my
dissertation, returning to Windows World is downright depressing).

Roland Rau suggested that I use a database with RSQLite and pointed out that
RODBC can work with MS Access. He also pointed me to a sub-chapter in Venables
and Ripley's _S Programming_ and "The Whole-Object View" pages in
John Chamber's _Programming with Data_.

Greg Snow recommended  biglm  for regression analysis with data that is too
large to fit into memory.

Last, but not least, Peter Dalgaard pointed out that there are options within R.
He suggests using the colClasses= argument for when "reading" data and
the what= argument for "scanning" data, so that you don't load
more columns than necessary. He also provided the following script:

     dict <-
readLines("ftp://www.sipp.census.gov/pub/sipp/2004/l04puw1d.txt")
     D.lines <- grep("^D ", dict)
     vdict <- read.table(con <- textConnection(dict[D.lines])); close(con)
     head(vdict) 

I'll try these solutions and report back on my success.

Thanks again!
- Eric

Seemingly Similar Threads

Search for more reasonably related threads

R help - Jul 2007 - the large dataset problem

[R] the large dataset problem

[R] the large dataset problem

[R] the large dataset problem

[R] the large dataset problem

[R] the large dataset problem

[R] the large dataset problem

Seemingly Similar Threads