Dear useRs, I recently began a job at a very large and heavily bureaucratic organization. We're setting up a research office and statistical analysis will form the backbone of our work. We'll be working with large datasets such the SIPP as well as our own administrative data. Due to the bureaucracy, it will take some time to get the licenses for proprietary software like Stata. Right now, R is the only statistical software package on my computer. This, of course, is a huge limitation because R loads data directly into RAM making it difficult (if not impossible) to work with large datasets. My computer only has 1000 MB of RAM, of which Microsucks Winblows devours 400 MB. To make my memory issues even worse, my computer has a virus scanner that runs everyday and I do not have the administrative rights to turn the damn thing off. I need to find some way to overcome these constraints and work with large datasets. Does anyone have any suggestions? I've read that I should "carefully vectorize my code." What does that mean ??? !!! The "Introduction to R" manual suggests modifying input files with Perl. Any tips on how to get started? Would Perl Data Language (PDL) be a good choice? http://pdl.perl.org/index_en.html I wrote a script which loads large datasets a few lines at a time, writes the dozen or so variables of interest to a CSV file, removes the loaded data and then (via a "for" loop) loads the next few lines .... I managed to get it to work with one of the SIPP core files, but it's SLOOOOW. Worse, if I discover later that I omitted a relevant variable, then I'll have to run the whole script all over again. Any suggestions? Thanks, - Eric
Dear useRs, I recently began a job at a very large and heavily bureaucratic organization. We're setting up a research office and statistical analysis will form the backbone of our work. We'll be working with large datasets such the SIPP as well as our own administrative data. Due to the bureaucracy, it will take some time to get the licenses for proprietary software like Stata. Right now, R is the only statistical software package on my computer. This, of course, is a huge limitation because R loads data directly into RAM making it difficult (if not impossible) to work with large datasets. My computer only has 1000 MB of RAM, of which Microsucks Winblows devours 400 MB. To make my memory issues even worse, my computer has a virus scanner that runs everyday and I do not have the administrative rights to turn the damn thing off. I need to find some way to overcome these constraints and work with large datasets. Does anyone have any suggestions? I've read that I should "carefully vectorize my code." What does that mean ??? !!! The "Introduction to R" manual suggests modifying input files with Perl. Any tips on how to get started? Would Perl Data Language (PDL) be a good choice? http://pdl.perl.org/index_en.html I wrote a script which loads large datasets a few lines at a time, writes the dozen or so variables of interest to a CSV file, removes the loaded data and then (via a "for" loop) loads the next few lines .... I managed to get it to work with one of the SIPP core files, but it's SLOOOOW. Worse, if I discover later that I omitted a relevant variable, then I'll have to run the whole script all over again. Any suggestions? Thanks, - Eric
Hi Eric, I'm facing a similar problem. Looking over the list of packages I came across: R.huge: Methods for accessing huge amounts of data http://cran.r-project.org/src/contrib/Descriptions/R.huge.html I haven't installed it yet so I don't know how well it works. I probably won't have time until next week at the earliest to look at it. Would be interested in hearing your feedback if you do try it. - Bruce -----Original Message----- From: r-help-bounces at stat.math.ethz.ch [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Eric Doviak Sent: Saturday, July 28, 2007 2:08 PM To: r-help at stat.math.ethz.ch Subject: [R] the large dataset problem Dear useRs, I recently began a job at a very large and heavily bureaucratic organization. We're setting up a research office and statistical analysis will form the backbone of our work. We'll be working with large datasets such the SIPP as well as our own administrative data. Due to the bureaucracy, it will take some time to get the licenses for proprietary software like Stata. Right now, R is the only statistical software package on my computer. This, of course, is a huge limitation because R loads data directly into RAM making it difficult (if not impossible) to work with large datasets. My computer only has 1000 MB of RAM, of which Microsucks Winblows devours 400 MB. To make my memory issues even worse, my computer has a virus scanner that runs everyday and I do not have the administrative rights to turn the damn thing off. I need to find some way to overcome these constraints and work with large datasets. Does anyone have any suggestions? I've read that I should "carefully vectorize my code." What does that mean ??? !!! The "Introduction to R" manual suggests modifying input files with Perl. Any tips on how to get started? Would Perl Data Language (PDL) be a good choice? http://pdl.perl.org/index_en.html I wrote a script which loads large datasets a few lines at a time, writes the dozen or so variables of interest to a CSV file, removes the loaded data and then (via a "for" loop) loads the next few lines .... I managed to get it to work with one of the SIPP core files, but it's SLOOOOW. Worse, if I discover later that I omitted a relevant variable, then I'll have to run the whole script all over again. Any suggestions? Thanks, - Eric ______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ********************************************************************** Please be aware that, notwithstanding the fact that the pers...{{dropped}}
Eric Doviak <edoviak <at> earthlink.net> writes:> > Dear useRs, > > I recently began a job at a very large and heavily bureaucratic organization.We're setting up a research> office and statistical analysis will form the backbone of our work. We'll beworking with large datasets> such the SIPP as well as our own administrative data.We need to know more about what you need to do with those large data sets in order to help -- giving some specific examples would be useful. In many situations you can set up a database connection or use Perl to select carefully and only load the observations/variables you need into R, but it's hard to make completely general suggestions. I'm not sure what the purpose of your code to read a few lines of a data file and write it to a CSV file is ... ? "Vectorizing" your code is figuring out a way to tell R how to do what you want as a single 'vector' operation -- for example to remove NAs from a vector you could do this: newvec = numeric(0) for (i in seq(along=oldvec)) { if (!is.na(oldvec[i])) newvec = c(newvec,oldvec[i]) } but this would be incredibly slow -- newvec = oldvec[!is.na(oldvec)] or newvec = na.omit(oldvec) would be far faster.
Check out the biglm package for some tools that may be useful. -----Original Message----- From: "Eric Doviak" <edoviak at earthlink.net> To: "r-help at stat.math.ethz.ch" <r-help at stat.math.ethz.ch> Sent: 7/30/07 9:54 AM Subject: [R] the large dataset problem Dear useRs, I recently began a job at a very large and heavily bureaucratic organization. We're setting up a research office and statistical analysis will form the backbone of our work. We'll be working with large datasets such the SIPP as well as our own administrative data. Due to the bureaucracy, it will take some time to get the licenses for proprietary software like Stata. Right now, R is the only statistical software package on my computer. This, of course, is a huge limitation because R loads data directly into RAM making it difficult (if not impossible) to work with large datasets. My computer only has 1000 MB of RAM, of which Microsucks Winblows devours 400 MB. To make my memory issues even worse, my computer has a virus scanner that runs everyday and I do not have the administrative rights to turn the damn thing off. I need to find some way to overcome these constraints and work with large datasets. Does anyone have any suggestions? I've read that I should "carefully vectorize my code." What does that mean ??? !!! The "Introduction to R" manual suggests modifying input files with Perl. Any tips on how to get started? Would Perl Data Language (PDL) be a good choice? http://pdl.perl.org/index_en.html I wrote a script which loads large datasets a few lines at a time, writes the dozen or so variables of interest to a CSV file, removes the loaded data and then (via a "for" loop) loads the next few lines .... I managed to get it to work with one of the SIPP core files, but it's SLOOOOW. Worse, if I discover later that I omitted a relevant variable, then I'll have to run the whole script all over again. Any suggestions? Thanks, - Eric ______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Just a note of thanks for all the help I have received. I haven't gotten a chance to implement any of your suggestions because I'm still trying to catalog all of them! Thank you so much! Just to recap (for my own benefit and to create a summary for others): Bruce Bernzweig suggested using the R.huge package. Ben Bolker pointed out that my original message wasn't clear and asked what I want to do with the data. At this point, just getting a dataset loaded would be wonderful, so I'm trying to trim variables (and if possible, I would also like to trim observations). He also provided an example of "vectorizing." Ted Harding suggested that I use AWK to process the data and provided the necessary code. He also tested his code on older hardware running GNU-Linux (or Unix?) and showed that AWK can process the data even when the computer has very little memory and processing power. Jim Holtman had similar success when he used Cygwin's UNIX utilities on a machine running MS Windows. They both used the following code: gawk 'BEGIN{FS=","}{print $(1) "," $(1000) "," $(1275) "," $(5678)}' < tempxx.txt > newdata.csv Fortunately, there is a version of GAWK for MS Windows. ... Not that I like MS Windows. It's just that I'm forced to use that 19th century operating system on the job. (After using Debian at home and happily running RKWard for my dissertation, returning to Windows World is downright depressing). Roland Rau suggested that I use a database with RSQLite and pointed out that RODBC can work with MS Access. He also pointed me to a sub-chapter in Venables and Ripley's _S Programming_ and "The Whole-Object View" pages in John Chamber's _Programming with Data_. Greg Snow recommended biglm for regression analysis with data that is too large to fit into memory. Last, but not least, Peter Dalgaard pointed out that there are options within R. He suggests using the colClasses= argument for when "reading" data and the what= argument for "scanning" data, so that you don't load more columns than necessary. He also provided the following script: dict <- readLines("ftp://www.sipp.census.gov/pub/sipp/2004/l04puw1d.txt") D.lines <- grep("^D ", dict) vdict <- read.table(con <- textConnection(dict[D.lines])); close(con) head(vdict) I'll try these solutions and report back on my success. Thanks again! - Eric