thr3ads.net - R help - [R] Can R handle medium and large size data sets? [Jan 2006]

If this information is useful, please help other people find it:
Share via:

Gueorgui Kolev

2006-Jan-24 14:17 UTC

[R] Can R handle medium and large size data sets?

Dear R experts,

Is it true that R generally cannot handle  medium sized data sets(a
couple of hundreds of thousands observations) and threrefore large
date set(couple of millions of observations)?

I googled and I found lots of questions regarding this issue, but
curiously there were no straightforward answers what can be done to
make R capable of handling data.

Is there sth inherent in the structure of R that makes it impossible
to work with say 100 000observations and more? If it is so, is there
any hope that R can be fixed in the future?

My experience is rather limited---I tried to load a Stata data set of
about 150000observations(which Stata handles instantly) using the
library "foreign". After half an hour R was still "thinking"
so I
stopped the attempts.

Thank you in advance,

Gueorgui Kolev

Department of Economics and Business
Universitat Pompeu Fabra

Philippe Grosjean

2006-Jan-24 15:03 UTC

head link

[R] Can R handle medium and large size data sets?

Hello,

This is not true that R cannot handle matrices of 100 000's 
observations... but:
- Importation (typically using read.table() and the like) "saturates" 
much faster. Solution: use scan() and fill a preallocated matrix, or 
better, use a database.

- Data frames are very nice objects, but if you handle only numeric 
data, do prefer matrices: they consume less memory. Also, avoid using 
row/column names for very large matrices/data frames.

- Finally, of course, your mileage varies greatly depending on the 
calculation you do on your data.

In general, the relatively widely admitted idea that R cannot handle 
large datasets originates from: using read.table() / data frames / non 
optimized code.

As an example, I can create a matrix of 150 000 observations (you don't 
tell us how many variables, so, I took 20 columns) filled with random 
numbers, and calculate the mean for each variable very easily. Here it is:

 > gc()
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells 168994  4.6     350000  9.4   350000  9.4
Vcells  62415  0.5     786432  6.0   290343  2.3
 > system.time(a <- matrix(runif(150000 * 20), ncol = 20))
[1] 0.48 0.05 0.55   NA   NA
 > # Just a little bit more than half a second to create a table of
 > # 3 millions entries filled with random numbers (P IV, 3Ghz, Win XP)
 > dim(a)
[1] 150000     20

 > system.time(print(colMeans(a)))
  [1] 0.4998859 0.5004760 0.4994155 0.5000711 0.5005029
  [6] 0.4999672 0.5003233 0.5000419 0.4997827 0.5004858
[11] 0.5004905 0.4993428 0.4991187 0.5000143 0.5016212
[16] 0.4988943 0.4990586 0.5009718 0.4997235 0.5001220
[1] 0.03 0.00 0.03   NA   NA
 > # 30 milliseconds to calculate the mean of all 20
 > # variables over 150 000 observations

 > gc()
           used (Mb) gc trigger (Mb) max used (Mb)
Ncells  169514  4.6     350000  9.4   350000  9.4
Vcells 3062785 23.4    9317558 71.1  9062793 69.2
 > # Less than 30 Mb used (with a peak at 80 Mb)

Isn't it manageable?
Best,

Philippe Grosjean

Gueorgui Kolev wrote:> Dear R experts,
> 
> Is it true that R generally cannot handle  medium sized data sets(a
> couple of hundreds of thousands observations) and threrefore large
> date set(couple of millions of observations)?
> 
> I googled and I found lots of questions regarding this issue, but
> curiously there were no straightforward answers what can be done to
> make R capable of handling data.
> 
> Is there sth inherent in the structure of R that makes it impossible
> to work with say 100 000observations and more? If it is so, is there
> any hope that R can be fixed in the future?
> 
> My experience is rather limited---I tried to load a Stata data set of
> about 150000observations(which Stata handles instantly) using the
> library "foreign". After half an hour R was still
"thinking" so I
> stopped the attempts.
> 
> Thank you in advance,
> 
> Gueorgui Kolev
> 
> Department of Economics and Business
> Universitat Pompeu Fabra
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html
> 
>

Wensui Liu

2006-Jan-24 15:51 UTC

head link

[R] Can R handle medium and large size data sets?

my experience is that 100,000 shouldn't be a problem. of course, it also
depends on your computer configuration.

On 1/24/06, Gueorgui Kolev <joro.kolev@gmail.com>
wrote:>
> Dear R experts,
>
> Is it true that R generally cannot handle  medium sized data sets(a
> couple of hundreds of thousands observations) and threrefore large
> date set(couple of millions of observations)?
>
> I googled and I found lots of questions regarding this issue, but
> curiously there were no straightforward answers what can be done to
> make R capable of handling data.
>
> Is there sth inherent in the structure of R that makes it impossible
> to work with say 100 000observations and more? If it is so, is there
> any hope that R can be fixed in the future?
>
> My experience is rather limited---I tried to load a Stata data set of
> about 150000observations(which Stata handles instantly) using the
> library "foreign". After half an hour R was still
"thinking" so I
> stopped the attempts.
>
> Thank you in advance,
>
> Gueorgui Kolev
>
> Department of Economics and Business
> Universitat Pompeu Fabra
>
> ______________________________________________
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
> http://www.R-project.org/posting-guide.html
>


--
WenSui Liu
(http://statcompute.blogspot.com)
Senior Decision Support Analyst
Health Policy and Clinical Effectiveness
Cincinnati Children Hospital Medical Center

	[[alternative HTML version deleted]]

Thomas Lumley

2006-Jan-24 16:06 UTC

head link

[R] Can R handle medium and large size data sets?

On Tue, 24 Jan 2006, Gueorgui Kolev wrote:
> Dear R experts,
>
> Is it true that R generally cannot handle  medium sized data sets(a
> couple of hundreds of thousands observations) and threrefore large
> date set(couple of millions of observations)?
>
> I googled and I found lots of questions regarding this issue, but
> curiously there were no straightforward answers what can be done to
> make R capable of handling data.
Because it depends on the situation.
> My experience is rather limited---I tried to load a Stata data set of
> about 150000observations(which Stata handles instantly) using the
> library "foreign". After half an hour R was still
"thinking" so I
> stopped the attempts.
Like Stata, R prefers to store all the data in memory, but because of R's 
flexibility it takes more memory than Stata does, and for simple analyses 
is slower. For simple analyses Stata probably needs only 10-20% as much 
memory as R on a given data set.

If you have a 64-bit version of R it can handle quite large data sets, 
certainly millions of records.  On the other hand an ordinary PC might 
well start to slow down noticeably with a few tens of thousands of 
reasonably complex records.

Often it is not necessary to store all the data in memory at once, and 
there are database interfaces to make this easier.

R (and S before it) have generally assumed that increasing computer power 
will solve a lot of problems more easily than programming would, and have 
generally been correct.

If you want Stata, you know where to find it (and it's a good choice for 
many problems).

 	-thomas

Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle

Martin Lam

2006-Jan-24 20:13 UTC

head link

[R] Can R handle medium and large size data sets?

Dear Gueorgui,
> Is it true that R generally cannot handle  medium
> sized data sets(a
> couple of hundreds of thousands observations) and
> threrefore large
> date set(couple of millions of observations)?
It depends on what you want to do with the data sets.
Loading the data sets shouldn't be any problem I
think. But using the data sets for analysis using self
written R code can get (very) slow,  since R is an
interpreted language (correct me if I'm wrong). To
increase speed you will often need to experiment with
the R code. For example, what I've noticed is that
processing data sets as matrices works much faster
than data.frame(). Writing your code in C(++), compile
it and include it in your R code is often the best
way.

HTH,

Martin

Seemingly Similar Threads

Search for more maybe matching threads

R help - Jan 2006 - Can R handle medium and large size data sets?

[R] Can R handle medium and large size data sets?

[R] Can R handle medium and large size data sets?

[R] Can R handle medium and large size data sets?

[R] Can R handle medium and large size data sets?

[R] Can R handle medium and large size data sets?

Seemingly Similar Threads