Hi there, I'm very new to R and am only in the beginning stages of investigating it for possible use. A document by John Maindonald at the r-project website entitled "Using R for Data Analysis and Graphics: Introduction, Code and Commentary" contains the following paragraph, "The R system may struggle to handle very large data sets. Depending on available computer memory, the processing of a data set containing one hundred thousand observations and perhaps twenty variables may press the limits of what R can easily handle". This document was written in 2004. My questions are: Is this still the case? If so, has anyone come up with creative solutions to mitigate these limitations? If you work with large data sets in R, what have your experiences been? From what I've seen so far, R seems to have enormous potential and capabilities. I routinely work with data sets of several hundred thousand to several million. It would be unfortunate if such potential and capabilities were not realized because of (effective) data set size limitations. Please tell me it ain't so. Thanks for any help or suggestions. Carl [[alternative HTML version deleted]]
The restriction is that objects are kept in memory so if you have sufficient memory and your OS lets you access it then you should be ok. S-Plus is a commercial package similar to R but stores its objects in files and can handle larger data sets if you run into trouble. Given that R is free and once downloaded can be installed on Windows in a minute or so (I assume its just as easy on other OSes) just install it and generate some test data and see if you have any problems, e.g. I had no trouble running the following on my PC: n <- 100000 p <- 20 x <- matrix(rnorm(n * p), n) colnames(x) <- letters[1:p] # regress column a against the rest x.lm <- lm(a ~., as.data.frame(x)) plot(x.lm) # click mouse to advance to successive plots summary(x.lm) On 6/13/06, Carl Hauser <Carl.Hauser at nwea.org> wrote:> Hi there, > > I'm very new to R and am only in the beginning stages of investigating > it for possible use. A document by John Maindonald at the r-project > website entitled "Using R for Data Analysis and Graphics: Introduction, > Code and Commentary" contains the following paragraph, "The R system may > struggle to handle very large data sets. Depending on available computer > memory, the processing of a data set containing one hundred thousand > observations and perhaps twenty variables may press the limits of what R > can easily handle". This document was written in 2004. > > My questions are: > > Is this still the case? If so, has anyone come up with creative > solutions to mitigate these limitations? If you work with large data > sets in R, what have your experiences been? > > >From what I've seen so far, R seems to have enormous potential and > capabilities. I routinely work with data sets of several hundred > thousand to several million. It would be unfortunate if such potential > and capabilities were not realized because of (effective) data set size > limitations. > > Please tell me it ain't so. > > Thanks for any help or suggestions. > > Carl > > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html >
If you need to analyze something bigger than memory can hold, one option is the biglm package which will fit linear regression models (and a lot of different analyses can be restructured as linear regression models) on blocks of data so that the entire dataset is not in memory all at the same time. I tested it out with a database with over 23 million rows and it worked great. It computed the exact same answers (to about 7 decimal places, I didn't bother to look beyond that) as a couple of other methods used for the same values. Hope this helps, -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.snow at intermountainmail.org (801) 408-8111 -----Original Message----- From: r-help-bounces at stat.math.ethz.ch [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Carl Hauser Sent: Tuesday, June 13, 2006 9:22 PM To: r-help at stat.math.ethz.ch Subject: [R] data set size question Hi there, I'm very new to R and am only in the beginning stages of investigating it for possible use. A document by John Maindonald at the r-project website entitled "Using R for Data Analysis and Graphics: Introduction, Code and Commentary" contains the following paragraph, "The R system may struggle to handle very large data sets. Depending on available computer memory, the processing of a data set containing one hundred thousand observations and perhaps twenty variables may press the limits of what R can easily handle". This document was written in 2004. My questions are: Is this still the case? If so, has anyone come up with creative solutions to mitigate these limitations? If you work with large data sets in R, what have your experiences been?>From what I've seen so far, R seems to have enormous potential andcapabilities. I routinely work with data sets of several hundred thousand to several million. It would be unfortunate if such potential and capabilities were not realized because of (effective) data set size limitations. Please tell me it ain't so. Thanks for any help or suggestions. Carl [[alternative HTML version deleted]] ______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html