hello, for my present project i need to use the data stored in a ca. 100mb stata dataset. when i import the data in R using: library("foreign") x<-read.dta("mydata.dta") i find that R needs a startling 665mb of memory! (in stata i can simply allocate, say, 128mb of memory and go ahead) is there anyway around this, or should i forget R for analysis of datasets of this magnitude? thanks for you help in this, edwin.
Edwin Leuven <e.leuven at uva.nl> writes:> for my present project i need to use the data stored in a ca. 100mb > stata dataset. > > when i import the data in R using: > > library("foreign") > x<-read.dta("mydata.dta") > > i find that R needs a startling 665mb of memory! > > (in stata i can simply allocate, say, 128mb of memory and go ahead) > > is there anyway around this, or should i forget R for analysis of > datasets of this magnitude?What does the 665 MB represent? Did you try doing a garbage collection after you had done the import? I would suggest library("foreign") x<-read.dta("mydata.dta") gc() # possibly repeat gc() to lower the thresholds object.size(x) # the actual storage (in bytes) allocated to this object save(x, file = "mydata.rda", compress = TRUE) After that you can start a new session and use load("mydata.rda") to obtain a copy of the data set without the storage overhead incurred by the stata -> R conversion. P.S. As described in the help page for object.size, the returned value is more properly described as an estimate of the object size because sometimes it is difficult to determine the object size accurately.
Is the ca. 100MB the size of the .dta file, or the size of the data when loaded into Stata? Or is there not a difference? Have you checked the size of the .rda file created as Doug had suggested? I'd be curious to see what that is... Andy> From: Edwin Leuven > > > What does the 665 MB represent? Did you try doing a garbage > > collection after you had done the import? > > i didn't (sorry, R beginner) > > i followed your example and things look much better now, and > object.size(x) returns: > > 219,167,604 > > which is about double the size of the same object in stata where it > is: > > 104,882,604 > > this leaves quite some room for improvement, but at least i can > now handle the data on my laptop... > > thanks for your quick response! edwin > > > > I would suggest > > > > library("foreign") > > x<-read.dta("mydata.dta") > > gc() # possibly repeat gc() to lower the thresholds > > object.size(x) # the actual storage (in bytes) allocated > to this object > > save(x, file = "mydata.rda", compress = TRUE) > > > > After that you can start a new session and use > > > > load("mydata.rda") > > > > to obtain a copy of the data set without the storage > overhead incurred > > by the stata -> R conversion. > > > > P.S. As described in the help page for object.size, the > returned value > > is more properly described as an estimate of the object size because > > sometimes it is difficult to determine the object size accurately. > > > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://www.stat.math.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html > >------------------------------------------------------------------------------ Notice: This e-mail message, together with any attachments,...{{dropped}}
On Mon, 29 Mar 2004, Edwin Leuven wrote:> > Stata tends to store data as float, integer, or even byte where > > appropriate for the precision. This is one source of space saving. A > > factor of 2 is not atypical. > > i was suspecting something like this. > > what does R do? default to double always (or something like this)? > > is this a deliberate design choice made by the R people or just > convenience while not having to worry about datatypes?R has only numeric (C double) and integer (C int) types for storing numeric data. I think the reason for not having single precision is insufficient accuracy in computation. Shorter integers might be useful sometimes. -thomas