thr3ads.net - R help - [R] tackle memory insufficiency for large dataset using save() & load()?? [Aug 2007]

If this information is useful, please help other people find it:
Share via:

Jessica Z

2007-Aug-22 01:25 UTC

[R] tackle memory insufficiency for large dataset using save() & load()??

Hello List, i have been agonizing over this for days, any reply would be greatly
appreciated!
   
  Situation:___________________________________ 
My original dataset is a .csv dataset (w/ 2M records) with 4 variables: 
job_id (Primary key, won't be used for analysis, just used for join tables),
sector_id (categorical variable, for 19 industry sectors), 
sqft (con't variable for square footage),       
building_type (categorical, for 2 building types)
  some values of sqft were inputed wrong, so i'd like to set sqft<1 to
"NA" and then use aregImpute() to impute those NAs.
   
  Problem: the origianl dataset(.csv format) is too large. though i could read
that dataset into R, i could not get aregImpute() run even i set the memory
limit to 3G ! (yes, i did the switch in windows to reach 3G rather than 2G)
   
  Goal: try to find a way to slim down my dataset so as to get aregImpute()
running.
   
  What i did:________________________________
  i searched in the archive, and found someone said, as R tends to inflate
memory, it is a good idea to first read the original dataset into R--> then
save it as a more compact binary file using save() --> and then reload the
compact binary file back into R using load(). this way would reduce the memory
allocation.
   
  HOWEVER, after i saved my original dataset into a compact binary file using
save(), and used "load("filename.Rdata") to reload the new
compact data format into R, I could not figure out how to retrive all my
variables!!! R shows the new dataset is not a list, nor a matrix, or a
dataframe, but just a character with length 1 !!! and there is no way i could do
attach().
   
  i generated a 1K-row subset out of my original dataset to illustrate my
problem (does anyone know how to get my four variables back from this
"compact binary" new dataset? what did i do wrong?):
  > data <- read.table (file.choose(),header=T,sep=",")
> summary(data)     job_id         sector_id           sqft        building_type  
 Min.   :   1.0   Min.   : 6.000   Min.   :  0.00   Min.   :1.000  
 1st Qu.: 250.8   1st Qu.: 6.000   1st Qu.:  3.00   1st Qu.:2.000  
 Median : 500.5   Median :11.000   Median :  4.00   Median :2.000  
 Mean   : 500.5   Mean   : 9.455   Mean   : 12.49   Mean   :1.996  
 3rd Qu.: 750.3   3rd Qu.:11.000   3rd Qu.:  4.00   3rd Qu.:2.000  
 Max.   :1000.0   Max.   :12.000   Max.   :192.00   Max.   :2.000 
>  
> attach(data)
> sqft[sqft<1] <- NA
> sector.f <- as.factor(sector_id)
> building_type.f <- as.factor (building_type)
> d <- data.frame(job_id,sector.f,sqft, building_type.f)
> summary (d)     job_id       sector.f      sqft        building_type.f
 Min.   :   1.0   6 :340   Min.   :  3.00   1:  4          
 1st Qu.: 250.8   11:505   1st Qu.:  4.00   2:996          
 Median : 500.5   12:155   Median :  4.00                  
 Mean   : 500.5            Mean   : 14.16                  
 3rd Qu.: 750.3            3rd Qu.: 17.00                  
 Max.   :1000.0            Max.   :192.00                  
                           NA's   :118.00                 
> save (d, file="compact_d.Rdata", ascii=FALSE)
> 
> newdata <- load ("compact_d.Rdata")
> 
> summary(newdata)   Length     Class      Mode 
        1 character character > attach(newdata)
Error in attach(newdata) : file 'd' not found> is.data.frame (newdata)
[1] FALSE> is.list (newdata)
[1] FALSE> is.matrix (newdata)
[1] FALSE>   _________________________________
btw, i also tried to just save (into compact binary) and reload (the new compact
binary data format) (as i could do the "NA" stuff in sql anyhow).
however, i still got stucked at the same spot:> data <- read.table (file.choose(),header=T,sep=",")
> summary(data)     job_id         sector_id           sqft        building_type  
 Min.   :   1.0   Min.   : 6.000   Min.   :  0.00   Min.   :1.000  
 1st Qu.: 250.8   1st Qu.: 6.000   1st Qu.:  3.00   1st Qu.:2.000  
 Median : 500.5   Median :11.000   Median :  4.00   Median :2.000  
 Mean   : 500.5   Mean   : 9.455   Mean   : 12.49   Mean   :1.996  
 3rd Qu.: 750.3   3rd Qu.:11.000   3rd Qu.:  4.00   3rd Qu.:2.000  
 Max.   :1000.0   Max.   :12.000   Max.   :192.00   Max.   :2.000 
> save (data, file="compact_data.Rdata", ascii=FALSE)
> newdata <- load ("compact_data.Rdata")
> summary(newdata)   Length     Class      Mode 
        1 character character > attach(newdata)Error: restore file may be empty -- no data loaded
In addition: Warning message:
file 'data' has magic number ''
   Use of save versions prior to 2 is deprecated > is.data.frame (newdata)
[1] FALSE> is.list (newdata)
[1] FALSE> is.matrix (newdata)
[1] FALSE>    
   

       
---------------------------------
Building a website is a piece of cake. 

	[[alternative HTML version deleted]]

Gabor Grothendieck

2007-Aug-22 01:48 UTC

head link

[R] tackle memory insufficiency for large dataset using save() & load()?

See ?save .  The ... arguments are the ***names*** of the objects, not
the objects
so you want save("d", ...whatever...) not save(d, ...whatever...) .
Also don't use attach and detach and read this about factors which applies
if your factor has many levels but can be ignored if not:
http://www.mail-archive.com/r-help at stat.math.ethz.ch/msg92970.html

On 8/21/07, Jessica Z <jessica_uw2000 at yahoo.com>
wrote:> Hello List, i have been agonizing over this for days, any reply would be
greatly appreciated!
>
>  Situation:___________________________________
> My original dataset is a .csv dataset (w/ 2M records) with 4 variables:
> job_id (Primary key, won't be used for analysis, just used for join
tables),
> sector_id (categorical variable, for 19 industry sectors),
> sqft (con't variable for square footage),
> building_type (categorical, for 2 building types)
>  some values of sqft were inputed wrong, so i'd like to set sqft<1
to "NA" and then use aregImpute() to impute those NAs.
>
>  Problem: the origianl dataset(.csv format) is too large. though i could
read that dataset into R, i could not get aregImpute() run even i set the memory
limit to 3G ! (yes, i did the switch in windows to reach 3G rather than 2G)
>
>  Goal: try to find a way to slim down my dataset so as to get aregImpute()
running.
>
>  What i did:________________________________
>  i searched in the archive, and found someone said, as R tends to inflate
memory, it is a good idea to first read the original dataset into R--> then
save it as a more compact binary file using save() --> and then reload the
compact binary file back into R using load(). this way would reduce the memory
allocation.
>
>  HOWEVER, after i saved my original dataset into a compact binary file
using save(), and used "load("filename.Rdata") to reload the new
compact data format into R, I could not figure out how to retrive all my
variables!!! R shows the new dataset is not a list, nor a matrix, or a
dataframe, but just a character with length 1 !!! and there is no way i could do
attach().
>
>  i generated a 1K-row subset out of my original dataset to illustrate my
problem (does anyone know how to get my four variables back from this
"compact binary" new dataset? what did i do wrong?):
>
> > data <- read.table (file.choose(),header=T,sep=",")
> > summary(data)
>     job_id         sector_id           sqft        building_type
>  Min.   :   1.0   Min.   : 6.000   Min.   :  0.00   Min.   :1.000
>  1st Qu.: 250.8   1st Qu.: 6.000   1st Qu.:  3.00   1st Qu.:2.000
>  Median : 500.5   Median :11.000   Median :  4.00   Median :2.000
>  Mean   : 500.5   Mean   : 9.455   Mean   : 12.49   Mean   :1.996
>  3rd Qu.: 750.3   3rd Qu.:11.000   3rd Qu.:  4.00   3rd Qu.:2.000
>  Max.   :1000.0   Max.   :12.000   Max.   :192.00   Max.   :2.000
> >
> > attach(data)
> > sqft[sqft<1] <- NA
> > sector.f <- as.factor(sector_id)
> > building_type.f <- as.factor (building_type)
> > d <- data.frame(job_id,sector.f,sqft, building_type.f)
> > summary (d)
>     job_id       sector.f      sqft        building_type.f
>  Min.   :   1.0   6 :340   Min.   :  3.00   1:  4
>  1st Qu.: 250.8   11:505   1st Qu.:  4.00   2:996
>  Median : 500.5   12:155   Median :  4.00
>  Mean   : 500.5            Mean   : 14.16
>  3rd Qu.: 750.3            3rd Qu.: 17.00
>  Max.   :1000.0            Max.   :192.00
>                           NA's   :118.00
> > save (d, file="compact_d.Rdata", ascii=FALSE)
> >
> > newdata <- load ("compact_d.Rdata")
> >
> > summary(newdata)
>   Length     Class      Mode
>        1 character character
> > attach(newdata)
> Error in attach(newdata) : file 'd' not found
> > is.data.frame (newdata)
> [1] FALSE
> > is.list (newdata)
> [1] FALSE
> > is.matrix (newdata)
> [1] FALSE
> >
>  _________________________________
> btw, i also tried to just save (into compact binary) and reload (the new
compact binary data format) (as i could do the "NA" stuff in sql
anyhow). however, i still got stucked at the same spot:
> > data <- read.table (file.choose(),header=T,sep=",")
> > summary(data)
>     job_id         sector_id           sqft        building_type
>  Min.   :   1.0   Min.   : 6.000   Min.   :  0.00   Min.   :1.000
>  1st Qu.: 250.8   1st Qu.: 6.000   1st Qu.:  3.00   1st Qu.:2.000
>  Median : 500.5   Median :11.000   Median :  4.00   Median :2.000
>  Mean   : 500.5   Mean   : 9.455   Mean   : 12.49   Mean   :1.996
>  3rd Qu.: 750.3   3rd Qu.:11.000   3rd Qu.:  4.00   3rd Qu.:2.000
>  Max.   :1000.0   Max.   :12.000   Max.   :192.00   Max.   :2.000
> > save (data, file="compact_data.Rdata", ascii=FALSE)
> > newdata <- load ("compact_data.Rdata")
> > summary(newdata)
>   Length     Class      Mode
>        1 character character
> > attach(newdata)
> Error: restore file may be empty -- no data loaded
> In addition: Warning message:
> file 'data' has magic number ''
>   Use of save versions prior to 2 is deprecated
> > is.data.frame (newdata)
> [1] FALSE
> > is.list (newdata)
> [1] FALSE
> > is.matrix (newdata)
> [1] FALSE
> >
>
>
>
>
> ---------------------------------
> Building a website is a piece of cake.
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Rolf Turner

2007-Aug-22 02:40 UTC

head link

[R] tackle memory insufficiency for large dataset using save() & load()?

On 22/08/2007, at 1:48 PM, Gabor Grothendieck wrote:
> See ?save .  The ... arguments are the ***names*** of the objects, not
> the objects
> so you want save("d", ...whatever...) not save(d, ...whatever...)
.
	I think this is wrong.  You want the objects not their names.

	If you want to make use of object names, use the list argument.

	I.e.

		save(melvin,clyde,file="irving")

	and

		save(list=c("melvin","clyde"),file="irving")

	accomplish the same thing.

				cheers,

					Rolf Turner

######################################################################
Attention:\ This e-mail message is privileged and confidenti...{{dropped}}

Charles C. Berry

2007-Aug-22 18:31 UTC

head link

[R] tackle memory insufficiency for large dataset using save() & load()??

On Tue, 21 Aug 2007, Jessica Z wrote:


[snip]


I did not notice a comment on this bit in the other replies:
>>
>> newdata <- load ("compact_d.Rdata")
>>
>> summary(newdata)
>   Length     Class      Mode
>        1 character character
newdata is a string whose value is 'd'

try print( newdata )

ls() should tell you there are two objects - 'd' and 'newdata'

So just continue using 'd', e.g.

 	summary( d )

HTH,

Chuck

[snip]

Charles C. Berry                            (858) 534-2098
                                             Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu	            UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901

Seemingly Similar Threads

Search for more maybe matching threads

R help - Aug 2007 - tackle memory insufficiency for large dataset using save() & load()??

[R] tackle memory insufficiency for large dataset using save() & load()??

[R] tackle memory insufficiency for large dataset using save() & load()?

[R] tackle memory insufficiency for large dataset using save() & load()?

[R] tackle memory insufficiency for large dataset using save() & load()??

Seemingly Similar Threads