Dear R experts, Is it true that R generally cannot handle medium sized data sets(a couple of hundreds of thousands observations) and threrefore large date set(couple of millions of observations)? I googled and I found lots of questions regarding this issue, but curiously there were no straightforward answers what can be done to make R capable of handling data. Is there sth inherent in the structure of R that makes it impossible to work with say 100 000observations and more? If it is so, is there any hope that R can be fixed in the future? My experience is rather limited---I tried to load a Stata data set of about 150000observations(which Stata handles instantly) using the library "foreign". After half an hour R was still "thinking" so I stopped the attempts. Thank you in advance, Gueorgui Kolev Department of Economics and Business Universitat Pompeu Fabra
Hello, This is not true that R cannot handle matrices of 100 000's observations... but: - Importation (typically using read.table() and the like) "saturates" much faster. Solution: use scan() and fill a preallocated matrix, or better, use a database. - Data frames are very nice objects, but if you handle only numeric data, do prefer matrices: they consume less memory. Also, avoid using row/column names for very large matrices/data frames. - Finally, of course, your mileage varies greatly depending on the calculation you do on your data. In general, the relatively widely admitted idea that R cannot handle large datasets originates from: using read.table() / data frames / non optimized code. As an example, I can create a matrix of 150 000 observations (you don't tell us how many variables, so, I took 20 columns) filled with random numbers, and calculate the mean for each variable very easily. Here it is: > gc() used (Mb) gc trigger (Mb) max used (Mb) Ncells 168994 4.6 350000 9.4 350000 9.4 Vcells 62415 0.5 786432 6.0 290343 2.3 > system.time(a <- matrix(runif(150000 * 20), ncol = 20)) [1] 0.48 0.05 0.55 NA NA > # Just a little bit more than half a second to create a table of > # 3 millions entries filled with random numbers (P IV, 3Ghz, Win XP) > dim(a) [1] 150000 20 > system.time(print(colMeans(a))) [1] 0.4998859 0.5004760 0.4994155 0.5000711 0.5005029 [6] 0.4999672 0.5003233 0.5000419 0.4997827 0.5004858 [11] 0.5004905 0.4993428 0.4991187 0.5000143 0.5016212 [16] 0.4988943 0.4990586 0.5009718 0.4997235 0.5001220 [1] 0.03 0.00 0.03 NA NA > # 30 milliseconds to calculate the mean of all 20 > # variables over 150 000 observations > gc() used (Mb) gc trigger (Mb) max used (Mb) Ncells 169514 4.6 350000 9.4 350000 9.4 Vcells 3062785 23.4 9317558 71.1 9062793 69.2 > # Less than 30 Mb used (with a peak at 80 Mb) Isn't it manageable? Best, Philippe Grosjean Gueorgui Kolev wrote:> Dear R experts, > > Is it true that R generally cannot handle medium sized data sets(a > couple of hundreds of thousands observations) and threrefore large > date set(couple of millions of observations)? > > I googled and I found lots of questions regarding this issue, but > curiously there were no straightforward answers what can be done to > make R capable of handling data. > > Is there sth inherent in the structure of R that makes it impossible > to work with say 100 000observations and more? If it is so, is there > any hope that R can be fixed in the future? > > My experience is rather limited---I tried to load a Stata data set of > about 150000observations(which Stata handles instantly) using the > library "foreign". After half an hour R was still "thinking" so I > stopped the attempts. > > Thank you in advance, > > Gueorgui Kolev > > Department of Economics and Business > Universitat Pompeu Fabra > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html > >
my experience is that 100,000 shouldn't be a problem. of course, it also depends on your computer configuration. On 1/24/06, Gueorgui Kolev <joro.kolev@gmail.com> wrote:> > Dear R experts, > > Is it true that R generally cannot handle medium sized data sets(a > couple of hundreds of thousands observations) and threrefore large > date set(couple of millions of observations)? > > I googled and I found lots of questions regarding this issue, but > curiously there were no straightforward answers what can be done to > make R capable of handling data. > > Is there sth inherent in the structure of R that makes it impossible > to work with say 100 000observations and more? If it is so, is there > any hope that R can be fixed in the future? > > My experience is rather limited---I tried to load a Stata data set of > about 150000observations(which Stata handles instantly) using the > library "foreign". After half an hour R was still "thinking" so I > stopped the attempts. > > Thank you in advance, > > Gueorgui Kolev > > Department of Economics and Business > Universitat Pompeu Fabra > > ______________________________________________ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html >-- WenSui Liu (http://statcompute.blogspot.com) Senior Decision Support Analyst Health Policy and Clinical Effectiveness Cincinnati Children Hospital Medical Center [[alternative HTML version deleted]]
On Tue, 24 Jan 2006, Gueorgui Kolev wrote:> Dear R experts, > > Is it true that R generally cannot handle medium sized data sets(a > couple of hundreds of thousands observations) and threrefore large > date set(couple of millions of observations)? > > I googled and I found lots of questions regarding this issue, but > curiously there were no straightforward answers what can be done to > make R capable of handling data.Because it depends on the situation.> My experience is rather limited---I tried to load a Stata data set of > about 150000observations(which Stata handles instantly) using the > library "foreign". After half an hour R was still "thinking" so I > stopped the attempts.Like Stata, R prefers to store all the data in memory, but because of R's flexibility it takes more memory than Stata does, and for simple analyses is slower. For simple analyses Stata probably needs only 10-20% as much memory as R on a given data set. If you have a 64-bit version of R it can handle quite large data sets, certainly millions of records. On the other hand an ordinary PC might well start to slow down noticeably with a few tens of thousands of reasonably complex records. Often it is not necessary to store all the data in memory at once, and there are database interfaces to make this easier. R (and S before it) have generally assumed that increasing computer power will solve a lot of problems more easily than programming would, and have generally been correct. If you want Stata, you know where to find it (and it's a good choice for many problems). -thomas Thomas Lumley Assoc. Professor, Biostatistics tlumley at u.washington.edu University of Washington, Seattle
Dear Gueorgui,> Is it true that R generally cannot handle medium > sized data sets(a > couple of hundreds of thousands observations) and > threrefore large > date set(couple of millions of observations)?It depends on what you want to do with the data sets. Loading the data sets shouldn't be any problem I think. But using the data sets for analysis using self written R code can get (very) slow, since R is an interpreted language (correct me if I'm wrong). To increase speed you will often need to experiment with the R code. For example, what I've noticed is that processing data sets as matrices works much faster than data.frame(). Writing your code in C(++), compile it and include it in your R code is often the best way. HTH, Martin