Jessica Z
2007-Aug-22 01:25 UTC
[R] tackle memory insufficiency for large dataset using save() & load()??
Hello List, i have been agonizing over this for days, any reply would be greatly appreciated! Situation:___________________________________ My original dataset is a .csv dataset (w/ 2M records) with 4 variables: job_id (Primary key, won't be used for analysis, just used for join tables), sector_id (categorical variable, for 19 industry sectors), sqft (con't variable for square footage), building_type (categorical, for 2 building types) some values of sqft were inputed wrong, so i'd like to set sqft<1 to "NA" and then use aregImpute() to impute those NAs. Problem: the origianl dataset(.csv format) is too large. though i could read that dataset into R, i could not get aregImpute() run even i set the memory limit to 3G ! (yes, i did the switch in windows to reach 3G rather than 2G) Goal: try to find a way to slim down my dataset so as to get aregImpute() running. What i did:________________________________ i searched in the archive, and found someone said, as R tends to inflate memory, it is a good idea to first read the original dataset into R--> then save it as a more compact binary file using save() --> and then reload the compact binary file back into R using load(). this way would reduce the memory allocation. HOWEVER, after i saved my original dataset into a compact binary file using save(), and used "load("filename.Rdata") to reload the new compact data format into R, I could not figure out how to retrive all my variables!!! R shows the new dataset is not a list, nor a matrix, or a dataframe, but just a character with length 1 !!! and there is no way i could do attach(). i generated a 1K-row subset out of my original dataset to illustrate my problem (does anyone know how to get my four variables back from this "compact binary" new dataset? what did i do wrong?):> data <- read.table (file.choose(),header=T,sep=",") > summary(data)job_id sector_id sqft building_type Min. : 1.0 Min. : 6.000 Min. : 0.00 Min. :1.000 1st Qu.: 250.8 1st Qu.: 6.000 1st Qu.: 3.00 1st Qu.:2.000 Median : 500.5 Median :11.000 Median : 4.00 Median :2.000 Mean : 500.5 Mean : 9.455 Mean : 12.49 Mean :1.996 3rd Qu.: 750.3 3rd Qu.:11.000 3rd Qu.: 4.00 3rd Qu.:2.000 Max. :1000.0 Max. :12.000 Max. :192.00 Max. :2.000> > attach(data) > sqft[sqft<1] <- NA > sector.f <- as.factor(sector_id) > building_type.f <- as.factor (building_type) > d <- data.frame(job_id,sector.f,sqft, building_type.f) > summary (d)job_id sector.f sqft building_type.f Min. : 1.0 6 :340 Min. : 3.00 1: 4 1st Qu.: 250.8 11:505 1st Qu.: 4.00 2:996 Median : 500.5 12:155 Median : 4.00 Mean : 500.5 Mean : 14.16 3rd Qu.: 750.3 3rd Qu.: 17.00 Max. :1000.0 Max. :192.00 NA's :118.00> save (d, file="compact_d.Rdata", ascii=FALSE) > > newdata <- load ("compact_d.Rdata") > > summary(newdata)Length Class Mode 1 character character> attach(newdata)Error in attach(newdata) : file 'd' not found> is.data.frame (newdata)[1] FALSE> is.list (newdata)[1] FALSE> is.matrix (newdata)[1] FALSE>_________________________________ btw, i also tried to just save (into compact binary) and reload (the new compact binary data format) (as i could do the "NA" stuff in sql anyhow). however, i still got stucked at the same spot:> data <- read.table (file.choose(),header=T,sep=",") > summary(data)job_id sector_id sqft building_type Min. : 1.0 Min. : 6.000 Min. : 0.00 Min. :1.000 1st Qu.: 250.8 1st Qu.: 6.000 1st Qu.: 3.00 1st Qu.:2.000 Median : 500.5 Median :11.000 Median : 4.00 Median :2.000 Mean : 500.5 Mean : 9.455 Mean : 12.49 Mean :1.996 3rd Qu.: 750.3 3rd Qu.:11.000 3rd Qu.: 4.00 3rd Qu.:2.000 Max. :1000.0 Max. :12.000 Max. :192.00 Max. :2.000> save (data, file="compact_data.Rdata", ascii=FALSE) > newdata <- load ("compact_data.Rdata") > summary(newdata)Length Class Mode 1 character character> attach(newdata)Error: restore file may be empty -- no data loaded In addition: Warning message: file 'data' has magic number '' Use of save versions prior to 2 is deprecated> is.data.frame (newdata)[1] FALSE> is.list (newdata)[1] FALSE> is.matrix (newdata)[1] FALSE>--------------------------------- Building a website is a piece of cake. [[alternative HTML version deleted]]
Gabor Grothendieck
2007-Aug-22 01:48 UTC
[R] tackle memory insufficiency for large dataset using save() & load()?
See ?save . The ... arguments are the ***names*** of the objects, not the objects so you want save("d", ...whatever...) not save(d, ...whatever...) . Also don't use attach and detach and read this about factors which applies if your factor has many levels but can be ignored if not: http://www.mail-archive.com/r-help at stat.math.ethz.ch/msg92970.html On 8/21/07, Jessica Z <jessica_uw2000 at yahoo.com> wrote:> Hello List, i have been agonizing over this for days, any reply would be greatly appreciated! > > Situation:___________________________________ > My original dataset is a .csv dataset (w/ 2M records) with 4 variables: > job_id (Primary key, won't be used for analysis, just used for join tables), > sector_id (categorical variable, for 19 industry sectors), > sqft (con't variable for square footage), > building_type (categorical, for 2 building types) > some values of sqft were inputed wrong, so i'd like to set sqft<1 to "NA" and then use aregImpute() to impute those NAs. > > Problem: the origianl dataset(.csv format) is too large. though i could read that dataset into R, i could not get aregImpute() run even i set the memory limit to 3G ! (yes, i did the switch in windows to reach 3G rather than 2G) > > Goal: try to find a way to slim down my dataset so as to get aregImpute() running. > > What i did:________________________________ > i searched in the archive, and found someone said, as R tends to inflate memory, it is a good idea to first read the original dataset into R--> then save it as a more compact binary file using save() --> and then reload the compact binary file back into R using load(). this way would reduce the memory allocation. > > HOWEVER, after i saved my original dataset into a compact binary file using save(), and used "load("filename.Rdata") to reload the new compact data format into R, I could not figure out how to retrive all my variables!!! R shows the new dataset is not a list, nor a matrix, or a dataframe, but just a character with length 1 !!! and there is no way i could do attach(). > > i generated a 1K-row subset out of my original dataset to illustrate my problem (does anyone know how to get my four variables back from this "compact binary" new dataset? what did i do wrong?): > > > data <- read.table (file.choose(),header=T,sep=",") > > summary(data) > job_id sector_id sqft building_type > Min. : 1.0 Min. : 6.000 Min. : 0.00 Min. :1.000 > 1st Qu.: 250.8 1st Qu.: 6.000 1st Qu.: 3.00 1st Qu.:2.000 > Median : 500.5 Median :11.000 Median : 4.00 Median :2.000 > Mean : 500.5 Mean : 9.455 Mean : 12.49 Mean :1.996 > 3rd Qu.: 750.3 3rd Qu.:11.000 3rd Qu.: 4.00 3rd Qu.:2.000 > Max. :1000.0 Max. :12.000 Max. :192.00 Max. :2.000 > > > > attach(data) > > sqft[sqft<1] <- NA > > sector.f <- as.factor(sector_id) > > building_type.f <- as.factor (building_type) > > d <- data.frame(job_id,sector.f,sqft, building_type.f) > > summary (d) > job_id sector.f sqft building_type.f > Min. : 1.0 6 :340 Min. : 3.00 1: 4 > 1st Qu.: 250.8 11:505 1st Qu.: 4.00 2:996 > Median : 500.5 12:155 Median : 4.00 > Mean : 500.5 Mean : 14.16 > 3rd Qu.: 750.3 3rd Qu.: 17.00 > Max. :1000.0 Max. :192.00 > NA's :118.00 > > save (d, file="compact_d.Rdata", ascii=FALSE) > > > > newdata <- load ("compact_d.Rdata") > > > > summary(newdata) > Length Class Mode > 1 character character > > attach(newdata) > Error in attach(newdata) : file 'd' not found > > is.data.frame (newdata) > [1] FALSE > > is.list (newdata) > [1] FALSE > > is.matrix (newdata) > [1] FALSE > > > _________________________________ > btw, i also tried to just save (into compact binary) and reload (the new compact binary data format) (as i could do the "NA" stuff in sql anyhow). however, i still got stucked at the same spot: > > data <- read.table (file.choose(),header=T,sep=",") > > summary(data) > job_id sector_id sqft building_type > Min. : 1.0 Min. : 6.000 Min. : 0.00 Min. :1.000 > 1st Qu.: 250.8 1st Qu.: 6.000 1st Qu.: 3.00 1st Qu.:2.000 > Median : 500.5 Median :11.000 Median : 4.00 Median :2.000 > Mean : 500.5 Mean : 9.455 Mean : 12.49 Mean :1.996 > 3rd Qu.: 750.3 3rd Qu.:11.000 3rd Qu.: 4.00 3rd Qu.:2.000 > Max. :1000.0 Max. :12.000 Max. :192.00 Max. :2.000 > > save (data, file="compact_data.Rdata", ascii=FALSE) > > newdata <- load ("compact_data.Rdata") > > summary(newdata) > Length Class Mode > 1 character character > > attach(newdata) > Error: restore file may be empty -- no data loaded > In addition: Warning message: > file 'data' has magic number '' > Use of save versions prior to 2 is deprecated > > is.data.frame (newdata) > [1] FALSE > > is.list (newdata) > [1] FALSE > > is.matrix (newdata) > [1] FALSE > > > > > > > --------------------------------- > Building a website is a piece of cake. > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Rolf Turner
2007-Aug-22 02:40 UTC
[R] tackle memory insufficiency for large dataset using save() & load()?
On 22/08/2007, at 1:48 PM, Gabor Grothendieck wrote:> See ?save . The ... arguments are the ***names*** of the objects, not > the objects > so you want save("d", ...whatever...) not save(d, ...whatever...) .I think this is wrong. You want the objects not their names. If you want to make use of object names, use the list argument. I.e. save(melvin,clyde,file="irving") and save(list=c("melvin","clyde"),file="irving") accomplish the same thing. cheers, Rolf Turner ###################################################################### Attention:\ This e-mail message is privileged and confidenti...{{dropped}}
Charles C. Berry
2007-Aug-22 18:31 UTC
[R] tackle memory insufficiency for large dataset using save() & load()??
On Tue, 21 Aug 2007, Jessica Z wrote: [snip] I did not notice a comment on this bit in the other replies:>> >> newdata <- load ("compact_d.Rdata") >> >> summary(newdata) > Length Class Mode > 1 character characternewdata is a string whose value is 'd' try print( newdata ) ls() should tell you there are two objects - 'd' and 'newdata' So just continue using 'd', e.g. summary( d ) HTH, Chuck [snip] Charles C. Berry (858) 534-2098 Dept of Family/Preventive Medicine E mailto:cberry at tajo.ucsd.edu UC San Diego http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901