Dear all, A few weeks ago, I asked this list why small Stata files became huge R files. Thomas Lumley said it was because "Stata uses single-precision floating point by default and can use 1-byte and 2-byte integers. R uses double precision floating point and four-byte integers." And it seemed I couldn't do anythig about it. Is it true? I mean, isn't there a (more or less simple) way to change how R stores data (maybe by changing the source code and compiling it)? The reason why I insist in this point is because I am trying to work with a data frame with more than 820.000 observations and 80 variables. The Stata file has 150Mb. With my Pentiun IV 2GHz and 1G RAM, Windows XP, I could't do the import using the read.dta() function from package foreign. With Stat Transfer I managed to convert the Stata file to a S file of 350Mb, but my machine still didn't manage to import it using read.S(). I even tried to "increase" my memory by memory.limit(4000), but it still didn't work. Regardless of the answer to my question, I'd appreciate to hear about your experience/suggestions in working with big files in R. Thank you for youR-Help, Dimitri Szerman
What you propose is not really a solution, as even if your data set didn't break the modified precision, another would. And of course, there is a price to be paid for reduced numerical precision. The real issue is that R's current design is incapable of dealing with data sets larger than what can fit in physical memory (expert comment/correction?). My understanding is that there is no way to change this without a fundamental redesign of R. This means that you must either live with R's limitations or use other software for "large" data sets. -- Bert Gunter Genentech Non-Clinical Statistics South San Francisco, CA "The business of the statistician is to catalyze the scientific learning process." - George E. P. Box> -----Original Message----- > From: r-help-bounces at stat.math.ethz.ch > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Dimitri Joe > Sent: Friday, March 03, 2006 11:28 AM > To: R-Help > Subject: [R] memory once again > > Dear all, > > A few weeks ago, I asked this list why small Stata files > became huge R > files. Thomas Lumley said it was because "Stata uses single-precision > floating point by default and can use 1-byte and 2-byte > integers. R uses > double precision floating point and four-byte integers." And > it seemed I > couldn't do anythig about it. > > Is it true? I mean, isn't there a (more or less simple) way to change > how R stores data (maybe by changing the source code and > compiling it)? > > The reason why I insist in this point is because I am trying to work > with a data frame with more than 820.000 observations and 80 > variables. > The Stata file has 150Mb. With my Pentiun IV 2GHz and 1G RAM, Windows > XP, I could't do the import using the read.dta() function > from package > foreign. With Stat Transfer I managed to convert the Stata > file to a S > file of 350Mb, but my machine still didn't manage to import it using > read.S(). > > I even tried to "increase" my memory by memory.limit(4000), > but it still > didn't work. > > Regardless of the answer to my question, I'd appreciate to hear about > your experience/suggestions in working with big files in R. > > Thank you for youR-Help, > > Dimitri Szerman > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html >
From: Berton Gunter> > What you propose is not really a solution, as even if your > data set didn't break the modified precision, another would. > And of course, there is a price to be paid for reduced > numerical precision. > > The real issue is that R's current design is incapable of > dealing with data sets larger than what can fit in physical > memory (expert comment/correction?). My understanding is that > there is no way to change this without a fundamental redesign > of R. This means that you must either live with R's > limitations or use other software for "large" data sets.Or spend about $80 to buy a gig of RAM... Andy> -- Bert Gunter > Genentech Non-Clinical Statistics > South San Francisco, CA > > "The business of the statistician is to catalyze the > scientific learning process." - George E. P. Box > > > > > -----Original Message----- > > From: r-help-bounces at stat.math.ethz.ch > > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Dimitri Joe > > Sent: Friday, March 03, 2006 11:28 AM > > To: R-Help > > Subject: [R] memory once again > > > > Dear all, > > > > A few weeks ago, I asked this list why small Stata files > > became huge R > > files. Thomas Lumley said it was because "Stata uses > single-precision > > floating point by default and can use 1-byte and 2-byte > > integers. R uses > > double precision floating point and four-byte integers." And > > it seemed I > > couldn't do anythig about it. > > > > Is it true? I mean, isn't there a (more or less simple) way > to change > > how R stores data (maybe by changing the source code and > > compiling it)? > > > > The reason why I insist in this point is because I am trying to work > > with a data frame with more than 820.000 observations and 80 > > variables. > > The Stata file has 150Mb. With my Pentiun IV 2GHz and 1G > RAM, Windows > > XP, I could't do the import using the read.dta() function > > from package > > foreign. With Stat Transfer I managed to convert the Stata > > file to a S > > file of 350Mb, but my machine still didn't manage to import > it using > > read.S(). > > > > I even tried to "increase" my memory by memory.limit(4000), > > but it still > > didn't work. > > > > Regardless of the answer to my question, I'd appreciate to > hear about > > your experience/suggestions in working with big files in R. > > > > Thank you for youR-Help, > > > > Dimitri Szerman > > > > ______________________________________________ > > R-help at stat.math.ethz.ch mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide! > > http://www.R-project.org/posting-guide.html > > > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html > >
On Fri, 3 Mar 2006, Dimitri Joe wrote:> Dear all, > > A few weeks ago, I asked this list why small Stata files became huge R > files. Thomas Lumley said it was because "Stata uses single-precision > floating point by default and can use 1-byte and 2-byte integers. R uses > double precision floating point and four-byte integers." And it seemed I > couldn't do anythig about it. > > Is it true? I mean, isn't there a (more or less simple) way to change > how R stores data (maybe by changing the source code and compiling it)?It's not impossible, but it really isn't as easy as you might think. It would be relatively easy to change the definition of REALSXPs and INTSXPs so that they stored 4-byte and 2-byte data respectively. It would be a lot harder to go through all the C and Fortran numerical, input/output, and other processing code to either translate from short to long data types or to make the code work for short data types. For example, the math functions would want to do computations in double (as Stata does) but the input/output functions would presumably want to use float. Adding two more SEXP types to give eg "single" and "shortint" might be easier (if there are enough bits left in the SEXPTYPE header), but would still require adding code to nearly every C function in R. Single-precision floating point has been discussed for R in the past, and the extra effort and resulting larger code were always considered too high a price. Since the size of data set R can handle doubles every 18 months or so without any effort on our part it is hard to motivate diverting effort away from problems that will not solve themselves. This doesn't help you, of course, but it may help explain why we can't. Another thing that might be worth pointing out: Stata also keeps all its data in memory and so can handle only "small" data sets. One reason that Stata is so fast and that Stata's small data sets can be larger than R's is the more restrictive language. This is more important than the compression from smaller data types -- you can use a dataset in Stata that is nearly as large as available memory (or address space), which is a factor of 3-10 better than R manages. On the other hand, for operations that do not fit well with the Stata language structure, it is quite slow. For example, the new Stata graphics in version 8 required some fairly significant extensions to the language and are still notably slower than the lattice graphics in R (a reasonably fair comparison since both are interpreted code). The terabyte-scale physics and astronomy data that other posters alluded to require a much more restrictive form of programming than R to get reasonable performance. R does not make you worry about how your data are stored and which data access patterns are fast or slow, but if your data are larger than memory you have to worry about these things. The difference between one-pass and multi-pass algorithms, between O(n) and O(n^2) time, even between sequential-access and random-access algorithms all matter, and the language can't hide them. Fortunately, most statistical problems are small enough to solve by throwing computing power at them, perhaps after an initial subsampling or aggegrating phase. The initial question was about read.dta. Now, read.dta() could almost certainly be improved a lot, especially for wide data sets. It uses very inefficient data frame operations to handle factors, for example. It used to be a lot faster than read.table, but that was before Brian Ripley improved read.table. -thomas