Vadlamani, Satish {FLNA}
2009-Mar-04 23:07 UTC
[R] Question about the use of large datasets in R
Hi: Sorry if this is a double post. I posted the same thing this morning and did not see it. I just started using R and am asking the following questions so that I can plan for the future when I may have to analyze volume data. 1) What are the limitations of R when it comes to handling large datasets? Say for example something like 200M rows and 15 columns data frame (between 1.5 to 2 GB in size)? Will the limitation be based on the specifications of the hardware or R itself? 2) Is R 32 bit compiled or 64 bit (on say Windows and AIX) 3) Are there any other points to note / things to keep in mind when handling large datasets? 4) Should I be looking at SAS also only for this reason (we do have SAS in-house but the problem is that I am still not sure what we have license for, etc.) Any pointers / thoughts will be appreciated. Satish
On Wed, 4 Mar 2009, Vadlamani, Satish {FLNA} wrote:
> Hi:
> Sorry if this is a double post. I posted the same thing this morning and
did not see it.
>
> I just started using R and am asking the following questions so that I can
plan for the future when I may have to analyze volume data.
>
> 1) What are the limitations of R when it comes to handling large datasets?
>Say for example something like 200M rows and 15 columns data frame (between
>1.5 to 2 GB in size)? Will the limitation be based on the specifications of
>the hardware or R itself?
It depends a lot on what you want to do. The default situation in R is that all
the data are loaded into memory, in which case the rule of thumb is that you
want data sets no larger than 1/3 of memory. If you have, say, a system with 8Gb
memory and a 64-bit version of R you should be ok.
It is often possible to work with much larger data sets than this, you just need
to arrange for the whole thing not to be loaded simultaneously. The right
strategy depends on the problem.
For example, linear and generalized linear models on large data sets can be
fitted with the biglm package. The various database interface packages and the
packages for netCDF and HDF5 allow subsets of a data set to be loaded easily.
Packages such as bigmemory and ff allow at least some operations to be carried
out on file-backed data objects.
> 2) Is R 32 bit compiled or 64 bit (on say Windows and AIX)
On AIX, 64 bit. On Windows, currently only 32-bit although there is work towards
a 64-bit version.
> 4) Should I be looking at SAS also only for this reason (we do have SAS
>in-house but the problem is that I am still not sure what we have license
for,
>etc.)
I would guess that it would be cheaper to buy hardware on which the problem can
be solved in R than to buy a SAS license (last time I looked, suitable
rack-mount Linux boxes were under USD3000). If you already have SAS available it
would be worth looking at it. For some large-data problems it will be faster or
easier to use, but not for all.
-thomas
Thomas Lumley Assoc. Professor, Biostatistics
tlumley at u.washington.edu University of Washington, Seattle