I'm working with a 350MB CSV file on a server that has 3GB of RAM, yet I'm hitting a memory error when I try to store the data frame into a survey design object, the R object that stores data for complex sample survey data. When I launch R, I execute the following line from Windows: "C:\Program Files\R\R-2.9.1\bin\Rgui.exe" --max-mem-size=2047M Anything higher, and I get an error message saying the maximum has been set to 2047M. Here are the commands:> library(survey)#this step takes more than five minutes> data08<-read.csv("data08.csv",header=TRUE,nrows=210437)> object.size(data08)#329877112 bytes #Looking at Windows Task Manager, Mem Usage for Rgui.exe is already 659,632K> brr.dsgn <-svrepdesign( data = data08 , repweights = data08[, grep("^repwgt" , colnames( data08)) ], type = "BRR" , combined.weights = TRUE , weights = data08$mainwgt ) #Error: cannot allocate vector of size 254.5 Mb #The survey design object does not get created. #This also causes Windows Task Manager, Mem Usage to spike to 1,748,136K #And here are some memory diagnostics> memory.limit()[1] 2047> memory.size()[1] 1449.06> gc()used (Mb) gc trigger (Mb) max used (Mb) Ncells 131148 3.6 593642 15.9 15680924 418.8 Vcells 45479988 347.0 173526492 1324.0 220358611 1681.3 A description of the survey package can be found here: http://faculty.washington.edu/tlumley/survey/ I tried creating a work-around by using the database-backed survey objects (DB SO), included in the survey package to conserve memory on larger datasets like this one. Unfortunately, I don't think the survey package supports database connections for replicate weight designs yet, since I've only been able to get a database connection working after creating a svydesign object and not a svrepdesign object - and also because neither the DB SO website nor the svrepdesign help page make any mention of those parameters. The DB SOs are described in detail here: http://faculty.washington.edu/tlumley/survey/svy-dbi.html Any advice would be truly appreciated. Thanks, Anthony Damico [[alternative HTML version deleted]]
tlumley at u.washington.edu
2009-Oct-23 17:24 UTC
[R] Memory Problems with CSV and Survey Objects
Yes, a 350Mb data frame is a bit big for 32-bit R to handle conveniently. As you note, the survey package doesn't yet do database-backed replicate-weight designs. You can get the same effect yourself without too much work. First, put the data into a database, such as SQLite. If you have the data frame read in then dbWriteTable will do it. Now, drop most of the variables, keeping the sampling weights, replicate weights, and a couple of other variables. Create a svrepdesign() with the reduced data set. When you want to do an analysis, use dbGetQuery() to load the variables you need for the analysis, and put them in the $variables component of the svrepdesign. That's exactly what the database-backed functions do for svydesign objects. [If you only ever want to use a small subset of the variables, it's even easier: drop all the extraneous variables and create a svrepdesign with the variables you want] -thomas On Fri, 23 Oct 2009, Anthony Damico wrote:> I'm working with a 350MB CSV file on a server that has 3GB of RAM, yet I'm > hitting a memory error when I try to store the data frame into a survey > design object, the R object that stores data for complex sample survey data. > > When I launch R, I execute the following line from Windows: > "C:\Program Files\R\R-2.9.1\bin\Rgui.exe" --max-mem-size=2047M > Anything higher, and I get an error message saying the maximum has been set > to 2047M. > > Here are the commands: >> library(survey) > > #this step takes more than five minutes >> data08<-read.csv("data08.csv",header=TRUE,nrows=210437) > >> object.size(data08) > #329877112 bytes > > #Looking at Windows Task Manager, Mem Usage for Rgui.exe is already 659,632K > >> brr.dsgn <-svrepdesign( data = data08 , repweights = data08[, grep( > "^repwgt" , colnames( data08)) ], type = "BRR" , combined.weights = TRUE , > weights = data08$mainwgt ) > #Error: cannot allocate vector of size 254.5 Mb > > #The survey design object does not get created. > > #This also causes Windows Task Manager, Mem Usage to spike to 1,748,136K > > #And here are some memory diagnostics >> memory.limit() > [1] 2047 >> memory.size() > [1] 1449.06 >> gc() > used (Mb) gc trigger (Mb) max used (Mb) > Ncells 131148 3.6 593642 15.9 15680924 418.8 > Vcells 45479988 347.0 173526492 1324.0 220358611 1681.3 > > A description of the survey package can be found here: > http://faculty.washington.edu/tlumley/survey/ > > I tried creating a work-around by using the database-backed survey objects > (DB SO), included in the survey package to conserve memory on larger > datasets like this one. Unfortunately, I don't think the survey package > supports database connections for replicate weight designs yet, since I've > only been able to get a database connection working after creating a > svydesign object and not a svrepdesign object - and also because neither the > DB SO website nor the svrepdesign help page make any mention of those > parameters. > > The DB SOs are described in detail here: > http://faculty.washington.edu/tlumley/survey/svy-dbi.html > > Any advice would be truly appreciated. > > Thanks, > Anthony Damico > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >Thomas Lumley Assoc. Professor, Biostatistics tlumley at u.washington.edu University of Washington, Seattle
Carlos J. Gil Bellosta
2009-Oct-24 13:37 UTC
[R] Memory Problems with CSV and Survey Objects
Hello, Adding to Thomas' email, you could also use package colbycol which allows you to load into R files that a simple read.table cannot cope with, study columns independently, select those you are more interested in and, finally, set up a dataframe with just the columns you are interested in. It is just the same strategy Thomas suggested, only that without the requirement of an external tool and using almost the same syntax as you would use in case you had no memory problems. Best regards, Carlos J. Gil Bellosta http://www.datanalytics.com On Fri, 2009-10-23 at 09:36 -0400, Anthony Damico wrote:> I'm working with a 350MB CSV file on a server that has 3GB of RAM, yet I'm > hitting a memory error when I try to store the data frame into a survey > design object, the R object that stores data for complex sample survey data. > > When I launch R, I execute the following line from Windows: > "C:\Program Files\R\R-2.9.1\bin\Rgui.exe" --max-mem-size=2047M > Anything higher, and I get an error message saying the maximum has been set > to 2047M. > > Here are the commands: > > library(survey) > > #this step takes more than five minutes > > data08<-read.csv("data08.csv",header=TRUE,nrows=210437) > > > object.size(data08) > #329877112 bytes > > #Looking at Windows Task Manager, Mem Usage for Rgui.exe is already 659,632K > > > brr.dsgn <-svrepdesign( data = data08 , repweights = data08[, grep( > "^repwgt" , colnames( data08)) ], type = "BRR" , combined.weights = TRUE , > weights = data08$mainwgt ) > #Error: cannot allocate vector of size 254.5 Mb > > #The survey design object does not get created. > > #This also causes Windows Task Manager, Mem Usage to spike to 1,748,136K > > #And here are some memory diagnostics > > memory.limit() > [1] 2047 > > memory.size() > [1] 1449.06 > > gc() > used (Mb) gc trigger (Mb) max used (Mb) > Ncells 131148 3.6 593642 15.9 15680924 418.8 > Vcells 45479988 347.0 173526492 1324.0 220358611 1681.3 > > A description of the survey package can be found here: > http://faculty.washington.edu/tlumley/survey/ > > I tried creating a work-around by using the database-backed survey objects > (DB SO), included in the survey package to conserve memory on larger > datasets like this one. Unfortunately, I don't think the survey package > supports database connections for replicate weight designs yet, since I've > only been able to get a database connection working after creating a > svydesign object and not a svrepdesign object - and also because neither the > DB SO website nor the svrepdesign help page make any mention of those > parameters. > > The DB SOs are described in detail here: > http://faculty.washington.edu/tlumley/survey/svy-dbi.html > > Any advice would be truly appreciated. > > Thanks, > Anthony Damico > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.