thr3ads.net - R help - [R] memory once again [Mar 2006]

If this information is useful, please help other people find it:
Share via:

Dimitri Joe

2006-Mar-03 19:28 UTC

[R] memory once again

Dear all,

A few weeks ago, I asked this list why small Stata files became huge R 
files. Thomas Lumley said it was because "Stata uses single-precision 
floating point by default and can use 1-byte and 2-byte integers. R uses 
double precision floating point and four-byte integers." And it seemed I 
couldn't do anythig about it.

Is it true? I mean, isn't there a (more or less simple) way to change 
how R stores data (maybe by changing the source code and compiling it)?

The reason why I insist in this point is because I am trying to work 
with a data frame with more than 820.000 observations and 80 variables. 
The Stata file has 150Mb. With my Pentiun IV 2GHz and 1G RAM, Windows 
XP, I could't do the import using the read.dta() function from package 
foreign. With Stat Transfer I managed to convert the Stata file to a S 
file of 350Mb, but my machine still didn't manage to import it using 
read.S().

I even tried to "increase" my memory by memory.limit(4000), but it
still
didn't work.

Regardless of the answer to my question, I'd appreciate to hear about 
your experience/suggestions in working with big files in R.

Thank you for youR-Help,

Dimitri Szerman

Berton Gunter

2006-Mar-03 19:42 UTC

head link

[R] memory once again

What you propose is not really a solution, as even if your data set didn't
break the modified precision, another would. And of course, there is a price
to be paid for reduced numerical precision.

The real issue is that R's current design is incapable of dealing with data
sets larger than what can fit in physical memory (expert
comment/correction?). My understanding is that there is no way to change
this without a fundamental redesign of R. This means that you must either
live with R's limitations or use other software for "large" data
sets.

-- Bert Gunter
Genentech Non-Clinical Statistics
South San Francisco, CA
 
"The business of the statistician is to catalyze the scientific learning
process."  - George E. P. Box
 
 
> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch 
> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Dimitri Joe
> Sent: Friday, March 03, 2006 11:28 AM
> To: R-Help
> Subject: [R] memory once again
> 
> Dear all,
> 
> A few weeks ago, I asked this list why small Stata files 
> became huge R 
> files. Thomas Lumley said it was because "Stata uses single-precision 
> floating point by default and can use 1-byte and 2-byte 
> integers. R uses 
> double precision floating point and four-byte integers." And 
> it seemed I 
> couldn't do anythig about it.
> 
> Is it true? I mean, isn't there a (more or less simple) way to change 
> how R stores data (maybe by changing the source code and 
> compiling it)?
> 
> The reason why I insist in this point is because I am trying to work 
> with a data frame with more than 820.000 observations and 80 
> variables. 
> The Stata file has 150Mb. With my Pentiun IV 2GHz and 1G RAM, Windows 
> XP, I could't do the import using the read.dta() function 
> from package 
> foreign. With Stat Transfer I managed to convert the Stata 
> file to a S 
> file of 350Mb, but my machine still didn't manage to import it using 
> read.S().
> 
> I even tried to "increase" my memory by memory.limit(4000), 
> but it still 
> didn't work.
> 
> Regardless of the answer to my question, I'd appreciate to hear about 
> your experience/suggestions in working with big files in R.
> 
> Thank you for youR-Help,
> 
> Dimitri Szerman
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html
>

Liaw, Andy

2006-Mar-03 19:57 UTC

head link

[R] memory once again

From: Berton Gunter> 
> What you propose is not really a solution, as even if your 
> data set didn't break the modified precision, another would. 
> And of course, there is a price to be paid for reduced 
> numerical precision.
> 
> The real issue is that R's current design is incapable of 
> dealing with data sets larger than what can fit in physical 
> memory (expert comment/correction?). My understanding is that 
> there is no way to change this without a fundamental redesign 
> of R. This means that you must either live with R's 
> limitations or use other software for "large" data sets.
Or spend about $80 to buy a gig of RAM...

Andy


 > -- Bert Gunter
> Genentech Non-Clinical Statistics
> South San Francisco, CA
>  
> "The business of the statistician is to catalyze the 
> scientific learning process."  - George E. P. Box
>  
>  
> 
> > -----Original Message-----
> > From: r-help-bounces at stat.math.ethz.ch
> > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Dimitri Joe
> > Sent: Friday, March 03, 2006 11:28 AM
> > To: R-Help
> > Subject: [R] memory once again
> > 
> > Dear all,
> > 
> > A few weeks ago, I asked this list why small Stata files
> > became huge R 
> > files. Thomas Lumley said it was because "Stata uses 
> single-precision 
> > floating point by default and can use 1-byte and 2-byte 
> > integers. R uses 
> > double precision floating point and four-byte integers." And 
> > it seemed I 
> > couldn't do anythig about it.
> > 
> > Is it true? I mean, isn't there a (more or less simple) way 
> to change
> > how R stores data (maybe by changing the source code and 
> > compiling it)?
> > 
> > The reason why I insist in this point is because I am trying to work
> > with a data frame with more than 820.000 observations and 80 
> > variables. 
> > The Stata file has 150Mb. With my Pentiun IV 2GHz and 1G 
> RAM, Windows 
> > XP, I could't do the import using the read.dta() function 
> > from package 
> > foreign. With Stat Transfer I managed to convert the Stata 
> > file to a S 
> > file of 350Mb, but my machine still didn't manage to import 
> it using 
> > read.S().
> > 
> > I even tried to "increase" my memory by memory.limit(4000),
> > but it still 
> > didn't work.
> > 
> > Regardless of the answer to my question, I'd appreciate to 
> hear about
> > your experience/suggestions in working with big files in R.
> > 
> > Thank you for youR-Help,
> > 
> > Dimitri Szerman
> > 
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list 
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide!
> > http://www.R-project.org/posting-guide.html
> >
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list 
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html
> 
>

Thomas Lumley

2006-Mar-06 17:01 UTC

head link

[R] memory once again

On Fri, 3 Mar 2006, Dimitri Joe wrote:
> Dear all,
>
> A few weeks ago, I asked this list why small Stata files became huge R
> files. Thomas Lumley said it was because "Stata uses single-precision
> floating point by default and can use 1-byte and 2-byte integers. R uses
> double precision floating point and four-byte integers." And it seemed
I
> couldn't do anythig about it.
>
> Is it true? I mean, isn't there a (more or less simple) way to change
> how R stores data (maybe by changing the source code and compiling it)?
It's not impossible, but it really isn't as easy as you might think.

It would be relatively easy to change the definition of REALSXPs and 
INTSXPs so that they stored 4-byte and 2-byte data respectively.  It would 
be a lot harder to go through all the C and Fortran numerical, 
input/output, and other processing code to either translate from short to 
long data types or to make the code work for short data types.  For 
example, the math functions would want to do computations in double (as 
Stata does) but the input/output functions would presumably want to use 
float.

Adding two more SEXP types to give eg "single" and
"shortint" might be
easier (if there are enough bits left in the SEXPTYPE header), but would 
still require adding code to nearly every C function in R.

Single-precision floating point has been discussed for R in the past, and 
the extra effort and resulting larger code were always considered too high 
a price.  Since the size of data set R can handle doubles every 18 months 
or so without any effort on our part it is hard to motivate diverting 
effort away from problems that will not solve themselves.  This doesn't 
help you, of course, but it may help explain why we can't.

Another thing that might be worth pointing out: Stata also keeps all its 
data in memory and so can handle only "small" data sets.  One reason
that
Stata is so fast and that Stata's small data sets can be larger than R's
is the more restrictive language. This is more important than the 
compression from smaller data types -- you can use a dataset in Stata that 
is nearly as large as available memory (or address space), which is a 
factor of 3-10 better than R manages. On the other hand, for operations 
that do not fit well with the Stata language structure, it is quite slow. 
For example, the new Stata graphics in version 8 required some fairly 
significant extensions to the language and are still notably slower than 
the lattice graphics in R (a reasonably fair comparison since both are 
interpreted code).

The terabyte-scale physics and astronomy data that other posters alluded 
to require a much more restrictive form of programming than R to get 
reasonable performance.  R does not make you worry about how your data are 
stored and which data access patterns are fast or slow, but if your data 
are larger than memory you have to worry about these things. The 
difference between one-pass and multi-pass algorithms, between O(n) and 
O(n^2) time, even between sequential-access and random-access algorithms 
all matter, and the language can't hide them. Fortunately, most 
statistical problems are small enough to solve by throwing computing power 
at them, perhaps after an initial subsampling or aggegrating phase.

The initial question was about read.dta. Now, read.dta() could almost 
certainly be improved a lot, especially for wide data sets. It uses very 
inefficient data frame operations to handle factors, for example.  It used 
to be a lot faster than read.table, but that was before Brian Ripley 
improved read.table.

 	-thomas

Seemingly Similar Threads

Search for more apparently analagous threads

R help - Mar 2006 - memory once again

[R] memory once again

[R] memory once again

[R] memory once again

[R] memory once again

Seemingly Similar Threads