Jessica Z
2007-Aug-22 01:25 UTC
[R] tackle memory insufficiency for large dataset using save() & load()??
Hello List, i have been agonizing over this for days, any reply would be greatly
appreciated!
Situation:___________________________________
My original dataset is a .csv dataset (w/ 2M records) with 4 variables:
job_id (Primary key, won't be used for analysis, just used for join tables),
sector_id (categorical variable, for 19 industry sectors),
sqft (con't variable for square footage),
building_type (categorical, for 2 building types)
some values of sqft were inputed wrong, so i'd like to set sqft<1 to
"NA" and then use aregImpute() to impute those NAs.
Problem: the origianl dataset(.csv format) is too large. though i could read
that dataset into R, i could not get aregImpute() run even i set the memory
limit to 3G ! (yes, i did the switch in windows to reach 3G rather than 2G)
Goal: try to find a way to slim down my dataset so as to get aregImpute()
running.
What i did:________________________________
i searched in the archive, and found someone said, as R tends to inflate
memory, it is a good idea to first read the original dataset into R--> then
save it as a more compact binary file using save() --> and then reload the
compact binary file back into R using load(). this way would reduce the memory
allocation.
HOWEVER, after i saved my original dataset into a compact binary file using
save(), and used "load("filename.Rdata") to reload the new
compact data format into R, I could not figure out how to retrive all my
variables!!! R shows the new dataset is not a list, nor a matrix, or a
dataframe, but just a character with length 1 !!! and there is no way i could do
attach().
i generated a 1K-row subset out of my original dataset to illustrate my
problem (does anyone know how to get my four variables back from this
"compact binary" new dataset? what did i do wrong?):
> data <- read.table (file.choose(),header=T,sep=",")
> summary(data)
job_id sector_id sqft building_type
Min. : 1.0 Min. : 6.000 Min. : 0.00 Min. :1.000
1st Qu.: 250.8 1st Qu.: 6.000 1st Qu.: 3.00 1st Qu.:2.000
Median : 500.5 Median :11.000 Median : 4.00 Median :2.000
Mean : 500.5 Mean : 9.455 Mean : 12.49 Mean :1.996
3rd Qu.: 750.3 3rd Qu.:11.000 3rd Qu.: 4.00 3rd Qu.:2.000
Max. :1000.0 Max. :12.000 Max. :192.00 Max. :2.000
>
> attach(data)
> sqft[sqft<1] <- NA
> sector.f <- as.factor(sector_id)
> building_type.f <- as.factor (building_type)
> d <- data.frame(job_id,sector.f,sqft, building_type.f)
> summary (d)
job_id sector.f sqft building_type.f
Min. : 1.0 6 :340 Min. : 3.00 1: 4
1st Qu.: 250.8 11:505 1st Qu.: 4.00 2:996
Median : 500.5 12:155 Median : 4.00
Mean : 500.5 Mean : 14.16
3rd Qu.: 750.3 3rd Qu.: 17.00
Max. :1000.0 Max. :192.00
NA's :118.00
> save (d, file="compact_d.Rdata", ascii=FALSE)
>
> newdata <- load ("compact_d.Rdata")
>
> summary(newdata)
Length Class Mode
1 character character > attach(newdata)
Error in attach(newdata) : file 'd' not found> is.data.frame (newdata)
[1] FALSE> is.list (newdata)
[1] FALSE> is.matrix (newdata)
[1] FALSE>
_________________________________
btw, i also tried to just save (into compact binary) and reload (the new compact
binary data format) (as i could do the "NA" stuff in sql anyhow).
however, i still got stucked at the same spot:> data <- read.table (file.choose(),header=T,sep=",")
> summary(data)
job_id sector_id sqft building_type
Min. : 1.0 Min. : 6.000 Min. : 0.00 Min. :1.000
1st Qu.: 250.8 1st Qu.: 6.000 1st Qu.: 3.00 1st Qu.:2.000
Median : 500.5 Median :11.000 Median : 4.00 Median :2.000
Mean : 500.5 Mean : 9.455 Mean : 12.49 Mean :1.996
3rd Qu.: 750.3 3rd Qu.:11.000 3rd Qu.: 4.00 3rd Qu.:2.000
Max. :1000.0 Max. :12.000 Max. :192.00 Max. :2.000
> save (data, file="compact_data.Rdata", ascii=FALSE)
> newdata <- load ("compact_data.Rdata")
> summary(newdata)
Length Class Mode
1 character character > attach(newdata)
Error: restore file may be empty -- no data loaded
In addition: Warning message:
file 'data' has magic number ''
Use of save versions prior to 2 is deprecated > is.data.frame (newdata)
[1] FALSE> is.list (newdata)
[1] FALSE> is.matrix (newdata)
[1] FALSE>
---------------------------------
Building a website is a piece of cake.
[[alternative HTML version deleted]]
Gabor Grothendieck
2007-Aug-22 01:48 UTC
[R] tackle memory insufficiency for large dataset using save() & load()?
See ?save . The ... arguments are the ***names*** of the objects, not
the objects
so you want save("d", ...whatever...) not save(d, ...whatever...) .
Also don't use attach and detach and read this about factors which applies
if your factor has many levels but can be ignored if not:
http://www.mail-archive.com/r-help at stat.math.ethz.ch/msg92970.html
On 8/21/07, Jessica Z <jessica_uw2000 at yahoo.com>
wrote:> Hello List, i have been agonizing over this for days, any reply would be
greatly appreciated!
>
> Situation:___________________________________
> My original dataset is a .csv dataset (w/ 2M records) with 4 variables:
> job_id (Primary key, won't be used for analysis, just used for join
tables),
> sector_id (categorical variable, for 19 industry sectors),
> sqft (con't variable for square footage),
> building_type (categorical, for 2 building types)
> some values of sqft were inputed wrong, so i'd like to set sqft<1
to "NA" and then use aregImpute() to impute those NAs.
>
> Problem: the origianl dataset(.csv format) is too large. though i could
read that dataset into R, i could not get aregImpute() run even i set the memory
limit to 3G ! (yes, i did the switch in windows to reach 3G rather than 2G)
>
> Goal: try to find a way to slim down my dataset so as to get aregImpute()
running.
>
> What i did:________________________________
> i searched in the archive, and found someone said, as R tends to inflate
memory, it is a good idea to first read the original dataset into R--> then
save it as a more compact binary file using save() --> and then reload the
compact binary file back into R using load(). this way would reduce the memory
allocation.
>
> HOWEVER, after i saved my original dataset into a compact binary file
using save(), and used "load("filename.Rdata") to reload the new
compact data format into R, I could not figure out how to retrive all my
variables!!! R shows the new dataset is not a list, nor a matrix, or a
dataframe, but just a character with length 1 !!! and there is no way i could do
attach().
>
> i generated a 1K-row subset out of my original dataset to illustrate my
problem (does anyone know how to get my four variables back from this
"compact binary" new dataset? what did i do wrong?):
>
> > data <- read.table (file.choose(),header=T,sep=",")
> > summary(data)
> job_id sector_id sqft building_type
> Min. : 1.0 Min. : 6.000 Min. : 0.00 Min. :1.000
> 1st Qu.: 250.8 1st Qu.: 6.000 1st Qu.: 3.00 1st Qu.:2.000
> Median : 500.5 Median :11.000 Median : 4.00 Median :2.000
> Mean : 500.5 Mean : 9.455 Mean : 12.49 Mean :1.996
> 3rd Qu.: 750.3 3rd Qu.:11.000 3rd Qu.: 4.00 3rd Qu.:2.000
> Max. :1000.0 Max. :12.000 Max. :192.00 Max. :2.000
> >
> > attach(data)
> > sqft[sqft<1] <- NA
> > sector.f <- as.factor(sector_id)
> > building_type.f <- as.factor (building_type)
> > d <- data.frame(job_id,sector.f,sqft, building_type.f)
> > summary (d)
> job_id sector.f sqft building_type.f
> Min. : 1.0 6 :340 Min. : 3.00 1: 4
> 1st Qu.: 250.8 11:505 1st Qu.: 4.00 2:996
> Median : 500.5 12:155 Median : 4.00
> Mean : 500.5 Mean : 14.16
> 3rd Qu.: 750.3 3rd Qu.: 17.00
> Max. :1000.0 Max. :192.00
> NA's :118.00
> > save (d, file="compact_d.Rdata", ascii=FALSE)
> >
> > newdata <- load ("compact_d.Rdata")
> >
> > summary(newdata)
> Length Class Mode
> 1 character character
> > attach(newdata)
> Error in attach(newdata) : file 'd' not found
> > is.data.frame (newdata)
> [1] FALSE
> > is.list (newdata)
> [1] FALSE
> > is.matrix (newdata)
> [1] FALSE
> >
> _________________________________
> btw, i also tried to just save (into compact binary) and reload (the new
compact binary data format) (as i could do the "NA" stuff in sql
anyhow). however, i still got stucked at the same spot:
> > data <- read.table (file.choose(),header=T,sep=",")
> > summary(data)
> job_id sector_id sqft building_type
> Min. : 1.0 Min. : 6.000 Min. : 0.00 Min. :1.000
> 1st Qu.: 250.8 1st Qu.: 6.000 1st Qu.: 3.00 1st Qu.:2.000
> Median : 500.5 Median :11.000 Median : 4.00 Median :2.000
> Mean : 500.5 Mean : 9.455 Mean : 12.49 Mean :1.996
> 3rd Qu.: 750.3 3rd Qu.:11.000 3rd Qu.: 4.00 3rd Qu.:2.000
> Max. :1000.0 Max. :12.000 Max. :192.00 Max. :2.000
> > save (data, file="compact_data.Rdata", ascii=FALSE)
> > newdata <- load ("compact_data.Rdata")
> > summary(newdata)
> Length Class Mode
> 1 character character
> > attach(newdata)
> Error: restore file may be empty -- no data loaded
> In addition: Warning message:
> file 'data' has magic number ''
> Use of save versions prior to 2 is deprecated
> > is.data.frame (newdata)
> [1] FALSE
> > is.list (newdata)
> [1] FALSE
> > is.matrix (newdata)
> [1] FALSE
> >
>
>
>
>
> ---------------------------------
> Building a website is a piece of cake.
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
Rolf Turner
2007-Aug-22 02:40 UTC
[R] tackle memory insufficiency for large dataset using save() & load()?
On 22/08/2007, at 1:48 PM, Gabor Grothendieck wrote:> See ?save . The ... arguments are the ***names*** of the objects, not > the objects > so you want save("d", ...whatever...) not save(d, ...whatever...) .I think this is wrong. You want the objects not their names. If you want to make use of object names, use the list argument. I.e. save(melvin,clyde,file="irving") and save(list=c("melvin","clyde"),file="irving") accomplish the same thing. cheers, Rolf Turner ###################################################################### Attention:\ This e-mail message is privileged and confidenti...{{dropped}}
Charles C. Berry
2007-Aug-22 18:31 UTC
[R] tackle memory insufficiency for large dataset using save() & load()??
On Tue, 21 Aug 2007, Jessica Z wrote: [snip] I did not notice a comment on this bit in the other replies:>> >> newdata <- load ("compact_d.Rdata") >> >> summary(newdata) > Length Class Mode > 1 character characternewdata is a string whose value is 'd' try print( newdata ) ls() should tell you there are two objects - 'd' and 'newdata' So just continue using 'd', e.g. summary( d ) HTH, Chuck [snip] Charles C. Berry (858) 534-2098 Dept of Family/Preventive Medicine E mailto:cberry at tajo.ucsd.edu UC San Diego http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901